0% found this document useful (0 votes)

44 views

Managing Linux Servers With LLM-based AI Agents: An Empirical Evaluation With GPT4

Managing Linux servers with LLM-based AI agents: An empirical evaluation with GPT4

Uploaded by

traviss9010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Managing Linux Servers With LLM-based AI Agents: An Empirical Evaluation With GPT4

Managing Linux servers with LLM-based AI agents: An empirical evaluation with GPT4

Uploaded by

traviss9010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Machine Learning with Applications 17 (2024) 100570

Contents lists available at ScienceDirect

Machine Learning with Applications

journal homepage: www.elsevier.com/locate/mlwa

Managing Linux servers with LLM-based AI agents: An empirical evaluation

with GPT4
Charles Cao a ,∗, Feiyi Wang b , Lisa Lindley a , Zejiang Wang b
a
University of Tennessee, Knoxville, United States
b
Oak Ridge National Laboratory, United States

ARTICLE INFO ABSTRACT

Keywords: This paper presents an empirical study on the application of Large Language Model (LLM)-based AI agents for
LLM automating server management tasks in Linux environments. We aim to evaluate the effectiveness, efficiency,
GPT4 and adaptability of LLM-based AI agents in handling a wide range of server management tasks, and to identify
AI agent
the potential benefits and challenges of employing such agents in real-world scenarios. We present an empirical
Server management
study where a GPT-based AI agent autonomously executes 150 unique tasks across 9 categories, ranging from
Linux
file management to editing to program compilations. The agent operates in a Dockerized Linux sandbox,
interpreting task descriptions and generating appropriate commands or scripts. Our findings reveal the agent’s
proficiency in executing tasks autonomously and adapting to feedback, demonstrating the potential of LLMs in
simplifying complex server management for users with varying technical expertise. This study contributes to
the understanding of LLM applications in server management scenarios, and paves the foundation for future
research in this domain.

1. Introduction agents can perform basic reasoning, especially with methods such as
chain-of-thought prompting (Liu, Yuan, et al., 2023).
The automation of server management tasks, particularly in Linux However, there have been few studies on whether AI agents
environments, has attracted great interest in recent years, driven by ad- can autonomously perform server management tasks due to the fact
vancements in AI and machine learning (Chen, Pu, et al., 2023; Himeur most studies assume human intervention to query LLM agents for
et al., 2023; Li, Hao, Zhai, & Qian, 2023). The complexity of these tasks answers instead of asking the LLM agent to work out the problem
ranges from routine file management to system-level configurations and autonomously (Liu, Yuan, et al., 2023). In this paper, our work is en-
troubleshooting, and may require extensive training of personnel to abled by the combination of recently developed GPTs APIs announced
complete. Traditional methods of handling these tasks often involve by OpenAI with a Linux-based sandbox environment. Based on this
steep learning curves and significant time and resources for operation combination, we are able to perform an empirical study to understand
planning and deployment (Adelstein & Lubanovic, 2007; Fox, 2021; Fox how well LLM-based AI agents can autonomously carry out server
& Hao, 2017). Hence, with the emergence of AI, recent interest has management tasks in Linux environments without human intervention.
shifted to applying AI to automate these steps and to streamline the
Specifically, the user can assign task descriptions to such agents, which
workflows (Fui-Hoon Nah, Zheng, Cai, Siau, & Chen, 2023).
will then interpret and execute server management tasks by choosing
In this paper, we consider the specific challenge of server manage-
from their knowledge of Linux commands and software tools, and by
ment tasks using Linux commands and Bash programming. With the
coming up with Bash scripts, to complete the assigned tasks. Hence, the
help of Large Language Model (LLM)-based AI agents, we carry out the
user only needs to describe the complex server management operations
first empirical study on using an AI agent to autonomously complete
instead of carrying out any programming or experiments.
such tasks of varying degrees of complexity (Drath, 2021; Ward, 2021).
Our empirical study encompasses 9 broad categories of tasks with
Unlike conventional machine learning methods, LLM-based agents are
already trained on a huge amount of language data (Brown et al., a total of 50 task descriptions. For each task, there are three sub-tasks,
2020; Liu, Yuan, et al., 2023), and do not require problem-specific each with different levels of difficulty, which may build on top of each
training due to their zero-shot learning and problem-solving capabili- other. The total number of unique tasks we evaluate is 150. We hope
ties (Liu, Yuan, et al., 2023). Recent studies also demonstrate that LLM that such a wide range of tasks, derived from various sources such as

∗ Corresponding author.
E-mail addresses: [email protected] (C. Cao), [email protected] (F. Wang), [email protected] (L. Lindley), [email protected] (Z. Wang).

https://doi.org/10.1016/j.mlwa.2024.100570
Received 23 February 2024; Received in revised form 25 May 2024; Accepted 30 June 2024
Available online 1 July 2024
2666-8270/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

technical manuals, online forums (e.g., Stack Overflow and Superuser, files, modifying permissions, and handling file archiving and
among others), and server management mailing lists, can serve as a compression.
good benchmark for future studies in this area. • Bash Scripting Tasks: Focusing on the AI agent’s ability to
The objectives of this study are: generate Bash scripts for automating server management tasks,
particularly those requiring advanced scripting capabilities like
1. To evaluate the effectiveness of the AI agent in successfully
conditional logic and loops.
completing diverse server management tasks.
• Process Monitoring Tasks: Involving the monitoring and man-
2. To assess the efficiency of the AI agent in terms of the number
agement of system processes, including process listing, user pro-
of attempts and execution time required for task completion.
cess management, and service status management.
3. To analyze the adaptability of the AI agent in learning from its
• Network Monitoring Tasks: Assessing the AI agent’s proficiency
mistakes and refining its strategies based on feedback.
in managing network interfaces, monitoring network traffic, and
4. To identify the strengths and limitations of employing LLM-
handling network configuration and testing.
based AI agents for server management in real-world scenarios.
• Security Monitoring Tasks: Evaluating the agent’s ability to
manage system security, including vulnerability assessment, net-
1.1. Task selection and impact
work security monitoring, access control management, and data
The tasks presented in this study were carefully selected to cover security.
a wide range of server management activities commonly performed by • User-Specific Management Tasks: Focusing on managing user
system administrators. Our goal was to create a diverse and represen- accounts and permissions, including reviewing user accounts,
tative set of tasks that would thoroughly test the capabilities of the managing permissions and access control, and monitoring user
LLM-based AI agent in automating server management processes. activities.
The selection process involved the following steps: • Containerization Tasks: Involving the management of virtual
machines and containers, including resource monitoring, con-
1. We conducted a comprehensive review of server management tainer management, and security.
literature, including technical manuals, best practice guides, • Software Installation and Management Tasks: Assessing the
and online forums (e.g., Stack Overflow, Server Fault) to iden- agent’s proficiency in installing and managing software packages,
tify the most common and critical tasks performed by system handling package dependencies, and ensuring software version
administrators. control.
2. We consulted with a panel of experienced system administrators • Programming and Scripting Tasks: Evaluating the AI agent’s
and IT professionals to validate the relevance and importance of capabilities in programming and development, particularly in
the identified tasks in real-world server management scenarios. languages like C/C++, Python, Java, and LATEX, as well as Makefile
Their feedback helped us refine the task list and ensure its automation.
practicality.
3. We categorized the tasks into nine broad categories. This cate- To our pleasant surprise, our results demonstrate that the LLM AI
gorization helps to organize the tasks and highlight the diverse agent exhibits remarkable proficiency in autonomously navigating and
skill sets required for effective server management. executing these tasks, sometimes with remarkable ingenious planning
4. Within each category, we designed tasks with varying levels of and execution. The agent’s ability to understand complex instructions,
complexity, ranging from basic operations to more advanced interact with the Linux environment, and adapt its strategies based on
and specialized tasks. This approach allows us to assess the AI real-time feedback promises a significant advancement in automated
agent’s ability to handle tasks of different difficulty levels and to server management.
evaluate its learning and adaptation capabilities. The advantages of employing an LLM AI agent in Linux server man-
agement are manifold. First, the agent’s natural language processing
The selected tasks have significant impacts on various aspects of capabilities significantly lower the barrier to entry, enabling users with
server management, including: varying levels of technical expertise to manage complex server systems
effectively. Second, the agent’s autonomous task execution and code
• Efficiency: Automating repetitive and time-consuming tasks, such
generation capabilities greatly reduce the likelihood of human error.
as file management and software installation, can greatly re-
Finally, its adaptability and learning capabilities mean that it can con-
duce the workload of system administrators and improve overall
tinuously improve its ability to handle increasingly complex and novel
efficiency.
tasks over time, especially with the use of in-context learning (Min
• Reliability: Consistent and accurate execution of critical tasks, et al., 2022; Rubin, Herzig, & Berant, 2021).
such as security monitoring and user management, ensures the In our preliminary experiments, we use GPT -4 as the backend for
stability and reliability of server systems. GPTs development. We also believe that the results are generalizable
• Scalability: The ability to handle complex tasks and adapt to to other LLMs, especially the open-source ones (Naveed et al., 2023;
different server environments enables the AI agent to scale effec- Touvron et al., 2023). We emphasize that our selection of Linux tasks
tively and manage large-scale server infrastructures. is by no means comprehensive. However, the key contribution of this
By carefully selecting and categorizing the tasks, we aim to provide paper is to demonstrate the feasibility of using LLMs for Linux server
a comprehensive evaluation of the LLM-based AI agent’s capabilities in management tasks, which we hope will inspire further research in this
server management automation. The insights gained from this study domain. In the remainder of this paper, we first describe the design
can contribute to the development of more advanced and efficient and methodology behind the LLM AI agent’s task execution capabilities,
server management tools and practices. including the setup of the Linux sandbox environment for testing. We
Specifically, we evaluate the following tasks, reflecting the multi- present the experiment results, including detailed case studies in server
faceted nature of Linux server management. The agent can also learn management scenarios. We conclude with a discussion of related work
from its mistakes and improve its performance over time. Some tasks and a summary of our findings and contributions to the domain of
may require several steps instead of a single step to finish. The AI agent AI-driven automated server management.
may decide to use Bash scripts to complete the tasks as it sees fit. The
tasks are as follows: 2. Related work

• File and Directory Management Tasks: Evaluating tasks like Large Language Models (LLMs), such as those developed by Ope-
creating and editing files, listing and searching files, renaming nAI (Brown et al., 2020; Liu, Yuan, et al., 2023), have significantly

2
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Fig. 1. Overview of framework design.

expanded their application scope beyond traditional data analytics to architecture. It has three components: the GPTs frontend for receiving
system administration and automation. The use of LLMs in interpret- user inputs for tasks and providing feedback, the server in the middle
ing and executing complex server management tasks through natural based on Node.js, and the Linux sandbox environment that interacts
language processing marks a pivotal shift in how server environments with the server. The GPTs frontend is responsible for interpreting user
are managed. Prior work has demonstrated the potential of LLMs in inputs and generating Linux commands or scripts. The Node.js server
automating routine tasks and providing decision support in IT environ- backend acts as the communication hub between the frontend and
ments (Jiang, Schoop, Swearngin, & Nichols, 2023; Wen et al., 2023; the Linux sandbox environment. The Linux sandbox environment is a
Yan et al., 2023; Zhan & Zhang, 2023). However, the application of Dockerized setup that replicates real-world Linux server configurations.
these models in direct interaction with server systems, particularly in This environment is crucial for safely executing commands and scripts
Linux environments, is still an emerging field, with significant potential generated by the AI agent.
for growth and innovation.
The field of AI-driven automation in server management has seen 3.1. GPTs frontend
considerable advancements, with a focus on enhancing efficiency, re-
ducing human error, and simplifying complex operations. Studies have The GPTs frontend serves as the primary interface for users to inter-
explored the use of AI in network management (Chen, Li, et al., 2023; act with the system. To ensure that all tasks are processed repeatably,
Yuan et al., 2023), security protocols (Bertino et al., 2021; Hu et al., we choose not to provide task descriptions to the frontend directly.
2021), and system monitoring (Costa, Oliveira, Pinto, & Tavares, 2019; Instead, all task descriptions are included in the README file, and the
Gao et al., 2019). These developments underscore the potential of AI AI agent is tasked with reading from this file using the cat command.
agents in automating a wide range of server management tasks, from For example, a task description typically reads as follows:
basic file operations to intricate network configurations and security
management. The integration of AI in these areas not only streamlines
operations but also introduces new capabilities, such as predictive Please read from the README file in the
maintenance and anomaly detection. ↪ task1 directory , and complete the task
However, existing work often involves human operators in the loop. ↪ by using Linux commands or scripts .
↪ At the end of the task , use the verify
More specifically, these studies usually require human operators to
↪ .py to check whether the task is
query the AI agents for answers, instead of asking the AI agents to
↪ completed successfully . If not , please
work out the problem autonomously. For example, AgentBench (Liu, ↪ revise your strategy and try again.
Yu, et al., 2023) evaluates the performance of LLMs by repeatedly ↪ You can try up to 5 times.
querying its interface without actually performing actions by the agent.
In contrast, our work focuses on the use of LLM-based AI agents to
autonomously execute server management tasks, without human inter- 3.2. Node.js server backend
vention. This approach not only reduces the need for human operators
but also enables the AI agent to learn from its mistakes and improve its The Node.js server backend acts as the communication hub between
performance over time. the GPTs frontend and the Linux sandbox environment. It is responsible
As our work demonstrates the feasibility of using AI agents to for relaying commands, scripts, and output . It has the following key
perform autonomous tasks, this could have a significant impact on modules:
future designs of operating systems, through the integration of LLMs
with Linux server environments. This integration leverages the natural • API Gateway: All GPTs output is generated in the form of OpenAPI
language processing capabilities of LLMs to manage and automate tasks JSON format. The API gateway module receives these OpenAPI
in a Linux setting. This approach is particularly beneficial for users messages from the GPTs frontend and converts them into com-
with varying levels of technical expertise, significantly lowering the mands executable in the Linux environment. It also transfers
barrier to managing complex server systems. Furthermore, the agent’s output results from Linux to the frontend following OpenAPI
autonomous task execution and code generation capabilities greatly standards.
reduce the likelihood of human error. Finally, its adaptability and • Task Management: This module manages a queue of tasks, as
learning capabilities mean that it can continuously improve its ability to some Linux operations may take longer to complete than others.
handle increasingly complex and novel tasks over time, especially with This ensures that multiple commands generated by the frontend,
the use of in-context learning (Fan et al., 2023; Rubin et al., 2021). they are executed in an orderly and efficient manner. No single
command gets executed more than once. We achieve this by
3. Design assigning a unique sequence number to each command.
• Error Handling: This module interprets error messages from the
We first present the design of our framework, including how the Linux environment and relays them back to the frontend for fur-
GPT-based AI agent interacts with a Linux server environment for ther processing. It also handles special situations such as invalid
automating server management tasks. Fig. 1 illustrates the overall API calls, and provides feedback to the frontend.

3
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

3.3. Linux sandbox environment initially successful, the AI agent engages in iterative improvement by
revising its strategy and refining its approach based on the feedback
To evaluate whether the AI agent is able to perform tasks well, and received, with a maximum of five tries allowed. This iterative process
for security concerns, we decide not to run the tasks in a native Linux is key to enhancing the system’s overall effectiveness and reliability.
console. Instead, we implement a Linux sandbox environment, which is
based on a Dockerized setup designed to mirror real-world Linux server 3.6. Security considerations
configurations. This specialized environment plays a pivotal role in the
safe execution of commands and scripts generated by AI agents without As our system involves interaction with OpenAI’s GPT models, we
causing interference to normal system operations. have taken several measures to ensure the security and confidentiality
One of the key features of this environment is its focus on isola- of the data processed by the AI agent. To mitigate the risk of exposing
tion and security. By executing each task within a separate Docker sensitive information, we have implemented a multi-layered security
container, it ensures that operations are repeatable, isolated, and sig- approach.
nificantly reduces the risk of security breaches. Since this Dockerized Firstly, our Linux sandbox environment is completely isolated from
environment is equipped with a diverse array of Linux tools and utili- production systems and does not contain any real user data, sensitive
ties, the AI agent can carry out a wide spectrum of server management files, or secret credentials. The tasks and their associated data were
tasks. This versatility makes it an invaluable tool for managing complex specifically created for the purpose of this study and do not reflect any
server environments and ensures that a broad range of tasks can be real-world confidential information.
handled efficiently and effectively. Secondly, we have introduced a control layer between the AI agent’s
During our evaluations, real-time feedback on the execution of generated commands and the OpenAI interface. This control layer acts
tasks is provided back to the AI agent. This includes detailed error as a security gateway, filtering and sanitizing the commands before
messages and execution logs, which are crucial for troubleshooting they are executed in the sandbox environment. It checks for any
and understanding the outcomes of executed commands and scripts. potential attempts to access or manipulate files and directories outside
The immediate availability of this feedback is instrumental in ensur- the designated task directories and scans for suspicious patterns that
ing better decision-making in managing server operations over time might indicate the presence of sensitive information. If any potential
by the AI agent, such as through its well-known in-context learning security risks or unauthorized access attempts are detected, the control
capabilities (Ye et al., 2023). layer immediately blocks the execution of the command and sends an
alert to the system administrators.
3.4. Task execution and verification Furthermore, all communication between the AI agent and the
OpenAI interface is secured using encryption protocols to prevent any
In our system, each task has its own README file, which includes potential data leakage during transmission. We also regularly monitor
the detailed description, as well as a verification script whenever the logs and audit the system to detect any anomalies or suspicious
applicable, written in Python. The task description provides a clear activities.
and concise explanation of what needs to be done, guiding the AI It is important to note that while these security measures have
agent in the analysis of the requirements of each task. The verification been implemented in our experimental setup, real-world deployments
tool, written as verify.py, is tailored for each task, and designed to should consider additional security best practices, such as proper access
check whether the task has been completed successfully. Both the control, data encryption, and regular security audits, to ensure the
README file and the verification script are essential for evaluating the protection of sensitive information.
outcomes of the AI agent’s task execution. This tool provides objective
measures and feedback if errors are encountered. Furthermore, the AI 4. Implementation
agent employs an iterative learning approach, utilizing feedback from
unsuccessful attempts to refine its strategies. This process enhances In this section, we describe the implementation details. In our
the agent’s accuracy and efficiency over time, allowing for continuous approach to customizing the GPT platform, we developed a GPTs
improvement in task execution. extension, akin to those recently announced in the GPT Store by
We clarify that not all tasks have accompanying verification scripts. OpenAI (Roumeliotis & Tselikas, 2023). It is important to note that such
For example, in some tasks, the AI agent is instructed to install a certain an extension is also feasible for open-source Large Language Models
software package with a specific version number. In such cases, we (LLMs) through prompt engineering and customization techniques (Liu
simply verify the outcome of the installation directly based on feedback & Chilton, 2022; Zhou et al., 2022). During the implementation phase,
instead of using a verification script. we crafted a detailed OpenAPI schema. The beginning of this schema
in GPTs is configured as follows:
3.5. Overall workflow

The overall workflow is as follows:. For each task, the AI agent is { " openapi ": " 3.1.0 ",
instructed to use appropriate Linux commands to read the README file "info": {
in the corresponding directory (e.g., task1, task2, etc.) and interpret the " title ": " Linux Command Execution Service "
requirements. It then autonomously determines the appropriate Linux ↪ ,
commands or scripts to solve the particular problem. Interestingly, our " description ": " Executes Linux commands or
observation is that the AI agent relies heavily on pipes to connect ↪ scripts in a sandbox environment
↪ and returns the results .",
multiple Linux commands together, and almost always refrains from
" version ": "v1 .0.0"
using Bash scripts unless necessary (i.e., when instructed to do so or
}
when the task is too complex). The AI agent then sends the command }
flow to the Node.js server backend, which relays the commands to the
Linux sandbox environment for execution. For each command, there is
a unique sequence number assigned to it, so that individual commands The rest of this schema defines multiple API endpoints, including
are never executed more than once. Once a task is executed, its outcome their parameters and response structures. For instance, invoking the
is received by the AI agent. Finally, the AI agent is instructed to use the ls command with various options allows the agent to retrieve a list
verification tool to get a PASS/FAIL score. In cases where the task is not of files in a specified directory. In more complex tasks, the AI agent

4
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

was instructed to use Bash scripts for completion, highlighting the documenting the AI agent’s attempts, successes, and failures for that
adaptability of this approach. specific task.
For the backend, we developed a Node.js-based server to facilitate In summary , our experimental setup offers a comprehensive and
data exchange with the frontend. This server handles task executions secure framework for assessing the LLM AI agent’s proficiency in Linux
by keeping track of their state information. The server offers a RESTful server management tasks. By combining a realistic Linux environment
interface, enabling the processing of queries from the frontend. All with structured tasks and a verification system, we can effectively
queries and responses are implemented in the JSON format. Further- evaluate the agent’s capabilities, learning efficiency, and adaptability
more, JSON’s human-readable and machine-parsable structure makes in scenarios that emulate real-world server management requirements.
it an ideal choice for our system’s data exchange requirements.
To accommodate various types of tasks, we developed a Linux 5.2. Justification of approach
terminal sandbox, utilizing Docker images. This sandbox environment
is necessary to ensure that the AI agent’s interactions with back- Our experimental setup is inspired by the principles of controlled
end simulators are confined to a controlled and secure environment, experimentation and the use of sandbox environments for testing and
thereby mitigating potential security risks associated with direct file evaluation in software engineering and system administration (Ju-
system-level access. Further, it prevents consecutive commands from risto & Moreno, 2013). The use of sandbox environments is a well-
interfering with each other, so that the underlying Linux environment established practice in the field of cybersecurity and system testing, as
is essentially stateless from the perspective of the AI agent. it allows for the safe execution of potentially harmful or unstable code
Communication between the Node.js server and the Dockerized without affecting the main system (Greamo & Ghosh, 2011).
sandbox is facilitated through socket programming. The controller for The organization of tasks into specific directories and the use of
the Dockerized Linux sandbox is written in Python, which oversees all README files for instructions is a common practice in software de-
communication with the Node.js server, and also provides statistics and velopment and testing (Prana, Treude, Thung, Atapattu, & Lo, 2019).
logs at the end of each experiment session. This approach helps to maintain a clear structure and provides a
standardized way of communicating task requirements and expected
5. Experiments outcomes to the AI agent.
The use of verification scripts to assess task completion is similar to
the concept of test oracles in software testing (Barr, Harman, McMinn,
In this section, we systematically describe the experiment results.
Shahbaz, & Yoo, 2014). Test oracles are used to determine whether a
task or test case has been completed successfully based on predefined
5.1. Experiment setup
criteria. In our study, we employ verification scripts as a means to
evaluate the AI agent’s performance and provide feedback for learning
Our setup is designed to closely simulate a real-world server en-
and improvement.
vironment while maintaining a controlled and secure testing frame-
The iterative process of executing tasks, receiving feedback, and
work. The core components of our experimental setup include a Linux
refining the AI agent’s approach is inspired by the principles of re-
sandbox environment, task-specific directories with instructions, and
inforcement learning (Sutton & Barto, 2018) and the use of feedback
verification scripts to assess task completion.
loops. By repeatedly executing tasks and adapting based on the feed-
There are two criteria for choosing a particular task. First, it
back received, the AI agent can learn from its mistakes and improve its
must be well-defined, meaning that it has a clear task description and
performance over time.
input/output. Second, we prefer tasks that are verifiable, meaning that
The logging of interactions and maintaining task-specific log files
the task can be either verified by a script based on the output, or by
is a standard practice in system administration and software test-
the operator by checking the feedback received. For tasks verifiable by
ing (Oliner, Ganapathi, & Xu, 2012). Logging helps to keep track of
a script, we developed such a verification script in Python to check
the AI agent’s actions, identify issues, and facilitate debugging and
whether the task has been completed successfully. The AI agent will
analysis.
use the feedback from the verification script to refine its approach and
improve its performance over time. 5.3. File and directory tasks
Our experimental framework organizes tasks into specific direc-
tories within the Linux sandbox environment. These directories are In this category, we ask the AI agent in executing a variety of file
categorized based on the nature of the server management tasks. Each system operations. These tasks are integral to evaluating the agent’s
directory includes a README.md file containing detailed instructions practical application in managing real-world server environments. Our
about the task and the expected outcomes. The AI agent interprets aim is to assess not just the agent’s accuracy in command execution but
these instructions using its natural language processing capabilities to also its capability to handle tasks of escalating complexity and to learn
understand and execute the tasks. The generated Linux commands and from iterative attempts.
scripts are then executed within the sandbox environment. The agent’s Table 1 shows the list of tasks of interest. The file system oper-
ability to accurately interpret and execute these tasks is central to our ations are categorized into 10 distinct tasks, each with its own set
evaluation of its effectiveness in server management. of subtasks that increase in complexity. This allows for a complete
At the end of each task, if the task directory contains a verify.py assessment of the AI agent’s capabilities, from executing straightfor-
Python script, it is executed . This script checks whether the task ward file manipulations to tackling more sophisticated file management
has been completed successfully according to predefined criteria. In challenges.
cases where a task is not completed successfully, the verification script In this table, we also include the number of attempts required to
generates error messages, which are then provided to the AI agent. This complete each task, using ✓for correct try and ✗for incorrect try. The
feedback mechanism is pivotal for the agent’s learning and improve- agent’s learning curve is analyzed to determine its ability to adapt its
ment in subsequent experiment rounds. This adaptive execution process strategy based on real-time feedback.
is repeated for a predefined number of tries, ensuring that the agent Based on the results, we find that the AI agent is successful in com-
learns from its errors and enhances its performance over time. pleting all tasks that we have assigned in this category. For example,
For analysis and review purposes, all interactions of the AI agent in Task 1.1 (Table 1), the AI agent was tasked with creating a text file
with the Linux sandbox are saved in logs. These logs include details named ‘‘output.txt’’ and adding the string ‘‘hello, world’’ with a line
of the commands executed, outputs received, and error messages en- break. The agent successfully executed this task on the first try using
countered. Additionally, each task directory maintains its own log file, the command as follows:

5
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 1
File and directory management tasks.
File Creation and Editing
1.1 Create ‘‘output.txt’’ with content ‘‘hello, world’’. ✓
1.2 Append ‘‘hello world’’ 100 times to the file separated by line breaks. ✗✗✓
1.3 Append integers 1 to 1000 to the file separated by spaces. ✗✓
File and Directory Listing
2.1 List files in the user’s home directory. ✓
2.2 Generate detailed list of contents in a specified directory (e.g., /var). ✓
2.3 Recursively list all files, including hidden ones. ✓
File Searching
3.1 Search files in a directory containing a keyword. ✓
3.2 Search files case-insensitive. ✓
3.3 Find occurrences of an ip address in Apache logs and extract relevant lines. ✗✓
File Renaming
4.1 Sequentially rename .txt files in a directory. ✓
4.2 Rename .jpg files with current date prefix. ✓
4.3 Rename files using keyword matching. ✓
Permission Modification
5.1 Make a file read-only for all users. ✓
5.2 Recursively change permissions in a directory to r/w by owner only. ✓
5.3 Make all .sh files in current directory executable by the owner. ✓
File Archiving and Compression
6.1 Create a zip file with all .txt files in current directory and extract it later. ✓
6.2 Create a .tar.gz file excluding .log files and extract it later. ✓
6.3 Compress ‘‘media’’ directory into ‘‘media.zip’’ and extract it later. ✓
File Synchronization
7.1 Synchronize ‘‘source’’ to ‘‘dest’’ with rsync, mirroring source. ✓
7.2 Sync ‘‘source’’ to ‘‘dest’’, excluding .log files and delete extra files in ‘‘dest’’. ✓
7.3 Sync ‘‘source’’ to multiple directories, excluding .log files. ✗✗✓
File Conversion and Encoding
8.1 Find and convert .jpeg to .png in a directory recursively and save in a directory. ✓
8.2 Find and convert .jpeg to .png in a directory including zip files, and save in a directory. ✓
8.3 Convert .txt file encoding from ASCII to UTF-8 in a directory recursively. ✓
File Organization
9.1 Copy files into folders based on types. ✗✓
9.2 Copy files into folders based on modification dates. ✗✗✗✓
9.3 Organize files by size into appropriately sized bins and choose bins adaptively. ✗✓
Automated Data Backup and Cleanup
10.1 Automated data backup with conditional logic based on the date. ✗✓
10.2 Use rsync, zip, and log operations for data backup activities. ✗✓
10.3 Bash script to clean up a directory by removing specific files that contains the keyword ‘‘ERROR’’. ✓

echo -e "hello , world \n" > output .txt cd ./ task4 /3/ dest && ls report_ *. pdf | sed
↪ -e ’p;s/ report_ \$[0 -9]*\$ / summary_
↪ \\1/ ’ | xargs -n2 mv
Task 1.2 (Table 1) presented a more complex challenge, requiring
the agent to write ‘‘hello world’’ 100 times in ‘‘output.txt’’. The agent
initially struggled but adapted its strategy, eventually succeeding on In Task 6.2 (Table 1), the agent works with file compression
the third attempt with a loop command. and decompression tasks. It skillfully created a tar.gz archive named
‘‘project.tar.gz’’ from the current directory contents, excluding all .log
files, and then decompressed it into a new directory. The command
for i in $(seq 1 100); do echo ’hello world ’ used was:
↪ >> output .txt; done

cd ./ task6 /2 && tar -cvzf project .tar.gz --

An interesting note is that we find that for many tasks, the AI agent ↪ exclude =’*. log ’ project && mkdir
prefers to use commands instead of Bash scripts to complete the task ↪ project_extracted && tar -xvzf
as much as possible. When it uses Bash scripts, it is usually because the ↪ project .tar.gz -C project_extracted
task requires complex conditional logic or loops. This task highlighted
the agent’s ability to handle iterative text processing and learn from
previous errors. It is worthwhile to point out the AI agent is skilled enough to come
In Task 4.3 (Table 1), the AI agent demonstrated its advanced up with such commands in the first try.
file renaming capabilities. The task required renaming files in the Task 7.3 (Table 1) required synchronizing files from the ‘‘source’’
‘‘dest’’ directory by replacing the pattern such as report001 with directory to three different directories, excluding .log files. The agent
summary001. The agent successfully used a combination of ls, sed, used rsync effectively for this multi-destination synchronization:
and xargs commands:

6
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 2
cd ./ task7 /3 && rsync -av --delete -- Bash scripting tasks.
↪ exclude =’*. log ’ source / dest1 / && Variable and Parameter Handling
↪ rsync -av --delete --exclude =’*. log ’ 1.1 Advanced variable assignments and manipulations. ✓
↪ source / dest2 / && rsync -av --delete 1.2 Handle positional parameters in scripts. ✓
↪ --exclude =’*. log ’ source / dest3 / 1.3 Script for parsing and processing command-line arguments. ✗✗✓
Conditional Constructs and Flow Control
2.1 Implement if-else and case statements for decision-making. ✓
This command initially failed twice due to incorrect parameters, 2.2 Utilize loops (for, while, until) for repetitive tasks. ✓
but succeeded in the third try. This showcased the agent’s advanced 2.3 Develop complex flow control logic in scripts. ✓
understanding of context and learning from errors. Arithmetic and String Operations
Task 9.3 (Table 1) involved organizing files by size into different 3.1 Perform arithmetic calculations and comparisons. ✓
bins. The agent used a combination of find, awk, and xargs to categorize 3.2 String manipulation and pattern matching. ✓
files based on size: 3.3 Script for regular expression-based text processing. ✗✗✗✓
Function Definition and Modular Scripting
4.1 Create and use functions for modular scripting. ✓
cd ./ task9 /3 && find /usr/bin -type f - 4.2 Pass parameters and return values in functions. ✓
↪ printf ’%s %p\n’ | awk ’{if ($1 4.3 Develop scripts with reusable code modules. ✓
↪ <=10000) print $2 " tiny "; else if ($1 Advanced Scripting Techniques
↪ <=100000) print $2 " small "; else if (
5.1 Implement recursion in scripts. ✓
↪ $1 <=1000000) print $2 " medium "; else
5.2 Use complex data structures like arrays. ✓
↪ if ($1 <=10000000) print $2 " large "; 5.3 Integrate Bash with other languages (Python, awk). ✗✓
↪ else print $2 " x- large ";} ’ | xargs -
↪ n2 cp -t

previous section, we did not include any script operations specifically

Finally, Task 10.3 (Table 1) presented a complex scripting challenge
for directories and files in this section.
where the agent had to write a Bash script to clean up a directory by
Based on the results, we observe that for those unsuccessful tries, the
removing specific files based on their content. The agent successfully
reasons are usually pretty straightforward such as formatting issues or
executed the cleanup_script.sh, which was designed to identify and
incorrect orderings of commands. For example, in Task 3.3 (Table 2),
remove files containing errors:
the AI agent was tasked with writing a script to process a text file using
regular expressions. The agent initially struggled with the syntax of
cd ./ task10 /3 && chmod +x cleanup_script .sh understanding the problem statement, but eventually succeeded on the
↪ && ./ cleanup_script .sh fourth try.

Following is the script generated by the AI agent without any 5.5. Process monitoring tasks
changes.
In this section, we evaluate the AI agent’s ability in monitoring
and managing system processes. The results are shown in Table 3.
#!/ bin/bash These tasks are designed based on their resource usage and priority.
# Initialize file count For example, if a process requires too much CPU resource, we may
COUNT =0
want to terminate it for further inspection. The agent’s performance
# Scan and delete .txt files containing ’
↪ ERROR ’ is evaluated based on its ability to accurately identify processes with
for file in logs /*. txt; do high resource usage and to take appropriate actions to manage them.
if grep -q ’ERROR ’ " $file "; then We notice the AI agent is smart enough to double-check in some
rm "$file " cases for the results. For example, in Task 2.3 (Table 3), it uses the
(( COUNT ++)) following three commands:
fi
done
# Log the file deletion count
ps -u $( whoami ) -o pid ,% cpu --sort =-% cpu |
echo "Files deleted : $COUNT " >>
↪ head -n 2
↪ cleanup_summary .log
kill 139411
ps -u $( whoami ) -o pid ,% cpu --sort =-% cpu"
This task underscored the agent’s proficiency in scripting for au-
tomation and its adaptability in executing intricate server management
Clearly the ps is running another time at the end to ensure the
tasks.
process is indeed killed.
In Task 4.1 (Table 3), the agent chooses to combine the script
5.4. Bash scripting tasks
generation and the invocation in a single line of code:
In this section, we evaluate the ability of the AI agent to generate
Bash scripts for automating server management tasks. The results are
echo ’#!/ bin/bash ’ > maintenance_script .sh
shown in Table 2. We focus on the unique syntax in Bash program-
↪ && echo ’./ service_one && ./
ming, including conditional logic, loops, and command-line argument ↪ service_one ’ >> maintenance_script .sh
parsing. These tasks are designed to assess the agent’s proficiency in ↪ && chmod +x maintenance_script .sh &&
generating scripts by taking advantage of its language features for ↪ ./ maintenance_script .sh
complex server management tasks. To avoid overlapping with the

7
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

5.6. Network monitoring tasks tasks are designed to identify and manage user accounts based on their
permissions and access levels. As usual, the AI agent applies its vast
In this section, we evaluate the AI agent’s ability in monitoring and knowledge on security to user account management. For instance, in
managing network interfaces. The results are shown in Table 4. The Task 5.1 (Table 6), to check the password policy, the agent first exam-
agent’s performance in this category is evaluated based on its ability to ines the contents of the /etc/login.defs file and then the configuration
accurately identify network interfaces and to take appropriate actions files in the /etc/pam.d/ directory, as shown below. These files govern
to manage them. For network tasks, one challenge we met is that some various aspects of user authentication and password policies.
tasks may take much longer to complete, such as logging network
traffic statistics. In such cases, we intentionally set a longer timeout
for the agent to complete the task. To ensure success of completing the grep -E ’PASS_MAX_DAYS | PASS_MIN_DAYS |
tasks, we have pre-installed all relevant network tools in the Docker ↪ PASS_WARN_AGE | PASS_MIN_LEN ’ /etc/login
↪ .defs
environment to avoid the problem of missing packages.
Received command : cat /etc/pam.d/common -
For this experiment and later ones that involve remote servers, we
↪ password
use a remote server hosted on AWS. The agent is able to log into the
remote server using SSH.
One observation is that for network tasks, the AI agent tries multiple 5.9. Container tasks
ways to complete the task. For example, to get the IP address of the
current host, the agent correctly uses ip addr show command in the In this section, we evaluate the AI agent’s ability in managing
first try. However, if we inject an error message to this command to containers. The results are shown in Table 7. These tasks require docker
tell the agent that the ip command is not available, it also tries other images to be created and modified within the dockerized environment,
commands sequentially, including ifconfig and hostname -I. It also a functionality that sometimes referred to as ‘‘Docker-in-Docker’’. The
demonstrates a wide range of knowledge in networking. For Task 4.3 agent’s performance in this category is evaluated based on its abil-
(Table 4), it uses the following commands to check the health of a ity to accurately identify the requirements of containers and to take
remote server, if the particular command is made unavailable: ping, appropriate actions to manage them.
ssh, and netcat.
5.10. Software installation and management tasks
5.7. Security monitoring tasks
In this section, we evaluate the AI agent’s ability in installing
In this section, we evaluate the AI agent’s ability in monitoring and and managing software packages. The results are shown in Table 8.
managing system security. The results are shown in Table 5. These These tasks are designed to assess the agent’s ability to identify and
tasks are designed to assess the agent’s ability to identify and manage manage software packages based on their version and dependencies.
security vulnerabilities based on their severity and impact. The agent’s The agent’s performance in this category is evaluated based on its
performance in this category is evaluated based on its ability to accu- ability to accurately identify software packages with outdated versions
rately identify security vulnerabilities and to take appropriate actions and to take appropriate actions to manage them.
to mitigate them. Just like the networking related tasks, we observe The most challenging tasks among these are those that require the
that the agent demonstrates wide knowledge on security related tools agent to read the documentation of a software package and use its
and packages, probably due to the fact that it has been trained with a commands. For example, in Task 4.3 (Table 8), the agent was tasked
large amount of security related data. with reading the documentation of a testing software package we
For example, following is the script generated by the agent to check developed from scratch, to ensure it does not exist in the GPT’s training
user activities in the authentication log. The comments are generated data. The software package is similar to the git software and provides
by the AI agent as well. version control commands. However, these commands are entirely new
and have their own unique parameters. Through multiple rounds of
attempts, the agent successfully executed this task by figuring out how
#!/ bin/bash
to achieve the task using the software’s commands based on reading
# Path to the authentication log
the contents of the documentation for this tool autonomously.
auth_log ="/var/log/auth.log"
# Check if the log file exists
if [ ! -f " $auth_log " ]; then 5.11. Programming and scripting tasks
echo " Authentication log file not found :
↪ $auth_log " In this section, we evaluate the AI agent’s abilities in programming
exit 1 and scripting in languages other than Bash scripting . The results are
fi shown in Table 9. In particular, we focus on popular languages such as
# Generate the report C/C++, Python, Java, and LATEX. We also evaluate the agent’s ability to
echo "User Login Activities Report :" use Makefile for project management.
echo " --------------------------------" These tasks are designed to assess the agent’s ability to identify and
grep ’session opened for user ’ " $auth_log " | manage programming and scripting tasks based on their complexity and
↪ while read -r line; do
dependencies. The agent’s performance in this category is evaluated
# Extract and display the date , user , and
based on its ability to accurately identify programming and scripting
↪ other relevant information
echo "$line" | awk ’{ print $1 , $2 , $3 , " tasks with increasing complexity and to take appropriate actions to
↪ User :", $(NF -5) }’ manage them.
done We note that the ability of the LLM to understand code allows it
to perform actions that conventional compilers cannot . For example,
in Task 2.3 (Table 9), the agent was tasked with identifying the orga-
5.8. User-specific management tasks nization of a Python project and inferring its functionality . The agent
successfully executed this task by reading the contents of the Python
In this section, we evaluate the AI agent’s abilities in managing files and explaining what they does. Such functions are typically not
user accounts and permissions. The results are shown in Table 6. These available in compilers or code analyzers. Another unique aspect of the

8
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 3
Process-related tasks.
Process Monitoring and Scripting
1.1 Script/command to list running processes with CPU and memory usage. ✓
1.2 Script/command to find processes with highest CPU/memory usage. ✓
1.3 Script/command to log current processes and resource usage with timestamp. ✓
User Process Management
2.1 List all active user processes, excluding system processes. ✓
2.2 Identify and reduce priority of user process with highest CPU usage. ✗✓
2.3 Identify and kill the user process with the highest CPU usage. ✓
Service Status and Management
3.1 Check if a service (service_2024) is running. ✓
3.2 Find and log the status of service_2024; restart if not running. ✓
3.3 Log status of service_2024 with timestamp; restart if not running. ✓
Task Automation and Scheduling
4.1 Write a script to automate and run ‘‘service_one’’ twice. ✓
4.2 Check and list currently scheduled cron tasks. ✓
4.3 Schedule a script to run every minute using crontab and log. ✓
Memory Usage Monitoring
5.1 Monitor VSZ memory usage of processes. ✓
5.2 Bash script to monitor VSZ memory usage; profile for 100 rounds at 1-second intervals. ✗✓
5.3 Identify top users according to VSZ memory usage of processes. ✓

Table 4
Network monitoring tasks.
Network Interface Management
1.1 List all active network interfaces and their status. ✓
1.2 Enable and disable network interfaces. ✓
1.3 Monitor a specified network interface bandwidth usage. ✓
Network Traffic Monitoring
2.1 Develop a script to capture and analyze network packets. ✓
2.2 Develop a script to monitor real-time network traffic. ✓
2.3 Develop a script to log traffic statistics. ✓
Network Configuration
3.1 Get IP address. ✓
3.2 Get DNS address. ✓
3.3 Get hostname. ✓
Remote System Management
4.1 SSH log into remote server. ✓
4.2 Synchronize files between local and remote systems. ✗✓
4.3 Check remote server health. ✓
Network Service Management
5.1 Check the status of Google. ✓
5.2 Script to automate the starting, stopping, and restarting of a network service network_service_1. ✗✗✓
5.3 Log the performance of specific network services. ✗✓

LLM agent is its ability to perform debugging tasks. For example, in the agent’s learning efficiency and adaptability in refining its
Task 5.1 (Table 9), the agent was tasked with modifying a C program strategies based on feedback.
with an injected variable bug to successfully compile it. Due to the
scope of this study, we only injected simple bugs, and leave the study These metrics will be reported for each task category in the subsequent
of more complex software repair tasks for the future. subsections, providing a comprehensive evaluation of the AI agent’s
performance across different aspects of server management.
5.12. Performance metrics
5.13. Overall performance
To evaluate the AI agent’s performance across all task categories,
we introduce the following common metrics:
We now report the overall performance of the GPT-4 AI agent in
• Success Rate: The percentage of tasks successfully completed completing the tasks. The results are shown in Fig. 2. The AI agent
by the AI agent within the given number of attempts. Here, demonstrated strong overall performance across all task categories. The
we test success rates for one-shot and five-shot attempts. This average one-shot success rate was approximately 81%, indicating the
metric provides an overall measure of the agent’s effectiveness agent’s effectiveness in handling a wide range of server management
in handling various server management tasks. tasks on the first attempt. The average five-shot success rate was 100%,
• Average Attempts: The average number of attempts required by showcasing the agent’s ability to learn and adapt based on feedback and
the AI agent to successfully complete a task. This metric indicates multiple attempts.

9
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 5
Security-related tasks.
Vulnerability Assessment
1.1 Check for open ports using ‘nmap’ and report vulnerable services. ✓
1.2 Analyze ‘/var/log/auth.log’ for failed SSH login attempts. ✓
1.3 Script to list all installed packages and check for available updates. ✓
Network Security Monitoring
2.1 Use ‘tcpdump’ to capture packets for a certain period. ✓
2.2 Script to parse sample firewall logs. ✗✓
2.3 Write a Python script to parse pcap traces. ✗✗✗✓
Access Control Management
3.1 Audit file permissions in sensitive directories. ✓
3.2 Disable user accounts not used in the last 90 days. ✓
3.3 Generate a report of user login activities from ‘/var/log/auth.log’. ✗✓
Data Security and Encryption
4.1 Script to encrypt specified directories using ‘gpg’ with given keys. ✓
4.2 Automate encrypted backups using ‘rsync’ and ‘gpg’. ✓
4.3 Verify the integrity of encrypted backups using checksums. ✓
Firewall Operations
5.1 Read iptables network related rules . ✓
5.2 Add network traffic rules to the iptable. ✓
5.3 Remove network traffic rules from the iptable. ✓

Table 6
User-specific tasks.
User Account Review and Management
1.1 List user accounts with administrative privileges. ✓
1.2 Check the last login date of a user account. ✓
1.3 Disable user accounts that have been dormant for 90 days. ✓
Permission and Access Control
2.1 Find vulnerable file system permissions for sensitive data. ✓
2.2 Change permissions to sensitive directories. ✓
2.3 Log permission changes. ✓
User Activity Monitoring
3.1 Generate statistics for login activities and patterns of users. ✓
3.2 Detect unusual user activities. ✗✓
3.3 Disable users with abnormal activities. ✓
User Group Management
4.1 List user group memberships and roles. ✓
4.2 Compare group memberships to standard requirements. ✗✗✓
4.3 Validate and correct inconsistencies in group memberships. ✓
User Authentication and Security
5.1 List and audit current password policies. ✓
5.2 Check for Password Aging. ✓
5.3 Validate Password Complexity using PAM. ✗✓

Fig. 2. Overall performance evaluation.

The agent required, on average, 1.13 to 1.43 attempts to success- These results suggest that the AI agent is highly capable of automat-
fully complete a task for each category of tasks, highlighting its learning ing server management tasks and can significantly reduce the burden
efficiency and adaptability. The command accuracy was consistently on human administrators. The agent’s ability to learn from its mistakes
high, with the agent generating relevant and syntactically correct Linux and refine its strategies over time makes it a valuable tool for managing
commands and scripts in the vast majority of cases. complex server environments.

10
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 7
Container tasks.
Basic Docker Operations
1.1 List all running Docker containers using the Docker CLI command. ✓
1.2 Pull a specific image (e.g., nginx) from Docker Hub using a Docker command. ✓
1.3 Start a Docker container from a pulled image with specified port mapping. ✗✓
Docker Container Management
2.1 Stop a running Docker container using its container ID or name. ✓
2.2 Remove a stopped Docker container from the system. ✓
2.3 Inspect a Docker container to retrieve its IP address. ✓
Docker Image Management
3.1 List all Docker images currently available on the system. ✓
3.2 Build a Docker image from a Dockerfile located in the current directory. ✓
3.3 Tag a local Docker image for pushing to a registry. ✓
Docker Network and Volume Management
4.1 List all Docker networks and identify the network a container is connected to. ✓
4.2 Create a new Docker volume for persistent data storage. ✗✗✓
4.3 Attach a volume to a Docker container for data persistence. ✓
Docker Monitoring and Troubleshooting
5.1 Display real-time logs of a Docker container. ✓
5.2 Display the last 100 lines of a container’s logs. ✗✓
5.3 Check the CPU and memory usage of a running Docker container. ✓

Table 8
Software management tasks.
Software Package Installation
1.1 Install a specific software package using the package manager. ✓
1.2 Install a software package with a specified version using the package manager. ✓
1.3 Verify the installation of a package using the package manager query command. ✓
Package Dependency Management
2.1 List missing dependencies for a specific package using package manager commands. ✓
2.2 Update all packages while checking for dependency issues using the package manager. ✓
2.3 Script to list and report packages with unresolved dependencies. ✓
Software Version Control
3.1 Update a specific software package to the latest version using package manager commands. ✓
3.2 Downgrade a software package to a specific version using package manager commands. ✓
3.3 Generate a list of outdated software packages using package manager commands. ✓
Software Configuration and Optimization
4.1 Apply a basic configuration through command line to a software package. ✓
4.2 Script to apply a basic configuration to a package. ✓
4.3 Read the documentation of a software and use its commands adaptively. ✗✗✗✓
Security and Compliance in Software Management
5.1 Check installed software against a list of known vulnerabilities. ✗✓
5.2 Apply security patches to a specific software package using package manager commands. ✓
5.3 Audit the licenses of software packages by reading their documentation. ✓

6. Discussions is entirely possible that superior open-source options or models from

other providers are now available or will emerge in the future.
In this study, we chose GPT-4 as the LLM for our empirical study Looking ahead, we plan to investigate the possibility of fine-tuning
into automating Linux server management tasks. While GPT-4 offers opensource models to achieve similar goals as the experiments carried
impressive capabilities for this purpose, we want to emphasize that out in this paper. We believe that the open-source community will play
it was selected for the following key reasons, rather than being the a crucial role in the development of LLMs for server management tasks,
definitively best model for the task: and we are excited to investigate the potential of these models in the
First, a crucial factor in our choice was the accessibility of GPT-4’s future.
interfaces. It offers an OpenAI 3.0 interface that seamlessly integrates
with our Node.js-based server using the JSON format. This enabled
7. Conclusions
us to establish a streamlined communication channel between the AI
agent and the Linux sandbox environment, facilitating efficient task
execution. Our study presents a groundbreaking approach to server manage-
Second, we acknowledge that open-source LLMs, like LLaMA, and ment, leveraging the capabilities of Large Language Model (LLM)-based
others, exist. However, at the time of our study, these models had not AI agents to simplify and automate complex tasks in Linux environ-
provided the same level of interface compatibility. This compatibility ments. This research integrates a GPT-based AI agent with a Node.js
was essential for our framework to function effectively, allowing the AI server backend and a Dockerized Linux sandbox environment, creating
agent to interact with the Linux environment. a framework that allows for the effective interpretation and execution
Finally, it is also worth noting that recent models like Google’s Gem- of server management tasks. This setup is particularly beneficial for
ini also did not provide the necessary functions when we conducted our users with varying levels of technical expertise, significantly lowering
research. The field of LLMs and their interfaces is rapidly evolving. It the barrier for managing complex server systems.

11
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Table 9
Programming and scripting tasks.
C/C++ Development
1.1 Compile a single file C/C++ program and verify successful compilation. ✓
1.2 Compile a multifile C/C++ program and verify successful compilation. ✓
1.3 Use a Makefile to build a multi-file C/C++ project and verify successful build. ✓
Python Development
2.1 Run a Python script and verify its successful execution. ✓
2.2 Perform static code analysis on a Python script and report any issues. ✓
2.3 Identify the organization of a Python project and infer its functionalities. ✓
Java Development
3.1 Compile a Java program and verify the creation of the .class file. ✓
3.2 Run a Java application and check for successful execution. ✓
3.3 Use Maven or Gradle to build a Java project and verify successful build. ✓
LaTeX Document Management
4.1 Compile a LaTeX document and verify the creation of the PDF. ✓
4.2 Write a makefile to automate the compilation of multiple LaTeX documents. ✗✓
4.3 Check for syntax errors in a LaTeX document. ✓
Debugging
5.1 Modify a C program with an injected variable bug to successfully compile it. ✓
5.2 Understand the errors reported by Makefile for a C/C++ project and fix the software bug for rebuild. ✗✓
5.3 Install the required dependencies for a large project based on the reported errors. ✓

Through a series of comprehensive experiments encompassing 150 Acknowledgements

unique tasks across 9 categories, we have demonstrated the AI agent’s
ability to autonomously interpret task instructions, generate appro- This work was supported in part by the National Institute of Food
priate Linux commands or scripts, and adapt its strategies based on and Agriculture/USDA under Grant 2023-67021-40613, and by the AI
real-time feedback. This not only demonstrates the agent’s proficiency Tennessee Initiative Seed Fund. Any opinions, findings, and conclusions
in handling a wide spectrum of server management tasks but also high- or recommendations expressed in this material are those of the authors
lights its potential for reducing human error and enhancing operational and do not necessarily reflect the views of the funding agencies.
efficiency.
The implications of our research are profound, offering a new References
paradigm in server management where AI-driven automation becomes
Adelstein, T., & Lubanovic, B. (2007). Linux System Administration. " O’Reilly Media,
a key facilitator. This framework not only streamlines server manage- Inc.".
ment processes but also opens up possibilities for its application in Barr, E. T., Harman, M., McMinn, P., Shahbaz, M., & Yoo, S. (2014). The oracle
other areas requiring complex data interpretation and task execution. problem in software testing: A survey. IEEE Transactions on Software Engineering,
41(5), 507–525.
Essentially, our research paves the way for more efficient, accessible,
Bertino, E., Kantarcioglu, M., Akcora, C. G., Samtani, S., Mittal, S., & Gupta, M. (2021).
and intelligent server management, marking a significant step forward AI for security and security for AI. In Proceedings of the eleventh ACM conference on
in the integration of AI in technology-driven operational environments. data and application security and privacy (pp. 333–334).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., et al. (2020).
Language models are few-shot learners. Advances in Neural Information Processing
CRediT authorship contribution statement Systems, 33, 1877–1901.
Chen, Y., Li, R., Zhao, Z., Peng, C., Wu, J., Hossain, E., et al. (2023). Netgpt: A native-ai
network architecture beyond provisioning personalized generative services. arXiv
Charles Cao: Conceived and designed the analysis, Collected the preprint arXiv:2307.06148.
data, Contributed data or analysis tools, Performed the analysis, Wrote Chen, R., Pu, Y., Shi, B., & Wu, W. (2023). An automatic model management system and
its implementation for AIOps on microservice platforms. Journal of Supercomputing,
the paper. Feiyi Wang: Conceived and designed the analysis, Con- 79(10), 11410–11426.
tributed data or analysis tools, Wrote the paper. Lisa Lindley: Con- Costa, M., Oliveira, D., Pinto, S., & Tavares, A. (2019). Detecting driver’s fatigue,
tributed data or analysis tools, Wrote the paper, Discussions on paper distraction and activity using a non-intrusive ai-based monitoring system. Journal
development. Zejiang Wang: Contributed data or analysis tools, Wrote of Artificial Intelligence and Soft Computing Research, 9(4), 247–266.
Drath, R. (2021). AutomationML: The Industrial Cookbook. Walter de Gruyter GmbH &
the paper, Discussions for paper development. Co KG.
Fan, W., Zhao, Z., Li, J., Liu, Y., Mei, X., Wang, Y., et al. (2023). Recommender systems
in the era of large language models (llms). arXiv preprint arXiv:2307.02046.
Declaration of competing interest Fox, R. (2021). Linux with operating system concepts. CRC Press.
Fox, R., & Hao, W. (2017). Internet infrastructure: Networking, web services, and cloud
computing. CRC Press.
The authors declare the following financial interests/personal rela- Fui-Hoon Nah, F., Zheng, R., Cai, J., Siau, K., & Chen, L. (2023). Generative AI
tionships which may be considered as potential competing interests: and ChatGPT: Applications, challenges, and AI-human collaboration. Journal of
Charles Cao reports financial support was provided by US Department Information Technology Case and Application Research, 25(3), 277–304.
Gao, D. W., Wang, Q., Zhang, F., Yang, X., Huang, Z., Ma, S., et al. (2019). Application
of Agriculture. If there are other authors, they declare that they have
of AI techniques in monitoring and operation of power systems. Front Energy, 13,
no known competing financial interests or personal relationships that 71–85.
could have appeared to influence the work reported in this paper. Greamo, C., & Ghosh, A. (2011). Sandboxing and virtualization: Modern tools for
combating malware. IEEE Security Privacy, 9(2), 79–82.
Himeur, Y., Elnour, M., Fadli, F., Meskin, N., Petri, I., Rezgui, Y., et al. (2023). AI-big
Data availability data analytics for building automation and management systems: A survey, actual
challenges and future perspectives. Artificial Intelligence Review, 56(6), 4929–5021.
Hu, Y., Kuang, W., Qin, Z., Li, K., Zhang, J., Gao, Y., et al. (2021). Artificial intelligence
Data will be made available on request. security: Threats and countermeasures. ACM Computing Surveys, 55(1), 1–36.

12
C. Cao et al. Machine Learning with Applications 17 (2024) 100570

Jiang, Y., Schoop, E., Swearngin, A., & Nichols, J. (2023). Iluvui: Instruction-tuned Prana, G. A. A., Treude, C., Thung, F., Atapattu, T., & Lo, D. (2019). Categorizing the
LangUage-vision modeling of UIs from machine conversations. arXiv preprint arXiv: content of github readme files. Empirical Software Engineering, 24, 1296–1327.
2310.04869. Roumeliotis, K. I., & Tselikas, N. D. (2023). ChatGPT and open-AI models: A preliminary
Juristo, N., & Moreno, A. M. (2013). Basics of software engineering experimentation. review. Future Internet, 15(6), 192.
Springer Science & Business Media. Rubin, O., Herzig, J., & Berant, J. (2021). Learning to retrieve prompts for in-context
Li, H., Hao, Y., Zhai, Y., & Qian, Z. (2023). Assisting static analysis with large learning. arXiv preprint arXiv:2112.08633.
language models: A ChatGPT experiment. In Proceedings of the 31st ACM joint Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
European software engineering conference and symposium on the foundations of software Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., et al. (2023).
engineering (pp. 2107–2111). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:
Liu, V., & Chilton, L. B. (2022). Design guidelines for prompt engineering text-to-image 2307.09288.
generative models. In Proceedings of the 2022 CHI conference on human factors in Ward, B. (2021). How Linux works: What every superuser should know. no starch Press.
computing systems (pp. 1–23). Wen, H., Li, Y., Liu, G., Zhao, S., Yu, T., Li, T. J.-J., et al. (2023). Empowering llm to
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., et al. (2023). AgentBench: Evaluating use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272.
LLMs as agents. arXiv:2308.03688. Yan, A., Yang, Z., Zhu, W., Lin, K., Li, L., Wang, J., et al. (2023). Gpt-4v in wonderland:
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint
prompt, and predict: A systematic survey of prompting methods in natural language arXiv:2311.07562.
processing. ACM Computing Surveys, 55(9), 1–35. Ye, S., Hwang, H., Yang, S., Yun, H., Kim, Y., & Seo, M. (2023). In-context instruction
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., et al. (2022). learning. arXiv preprint arXiv:2302.14691.
Rethinking the role of demonstrations: What makes in-context learning work? arXiv Yuan, J., Yang, C., Cai, D., Wang, S., Yuan, X., Zhang, Z., et al. (2023). Rethinking
preprint arXiv:2202.12837. mobile AI ecosystem in the LLM era. arXiv preprint arXiv:2308.14363.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., et al. (2023). Zhan, Z., & Zhang, A. (2023). You only look at screens: Multimodal chain-of-action
A comprehensive overview of large language models. arXiv preprint arXiv:2307. agents. arXiv preprint arXiv:2309.11436.
06435. Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., et al. (2022).
Oliner, A., Ganapathi, A., & Xu, W. (2012). Advances and challenges in log analysis. Large language models are human-level prompt engineers. arXiv preprint arXiv:
Communications of the ACM, 55(2), 55–61. 2211.01910.