learn-by-interaction-advancing-agentic-ai-for-web-automation-with-lang-graph
learn-by-interaction-advancing-agentic-ai-for-web-automation-with-lang-graph
1 Introduction
With the rapid development of large language model technology, Web agents,
as one of the key technologies for automated Web interaction, are gradually be-
coming a hot topic in research. Web agents can simulate the behavior of human
users, automatically browse web pages, execute tasks, and obtain information,
which is of great significance for improving information retrieval efficiency, auto-
mated testing, and enhancing user experience. However, building an efficient and
stable Web agent faces numerous challenges, including page navigation, visual
positioning, hallucination handling, and other issues. This research aims to de-
sign and implement a LangGraph - based Web agent, driven by the multimodal
large language model GPT-4o, and to realize multiple Web page operation tools
through the automated Web browsing environment Playwright.
The methodological design of this research includes the construction of the
LangGraph system architecture, the implementation of the Web browsing envi-
ronment, and the building of the Web agent. Firstly, a powerful automated Web
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
browsing environment is set up through Playwright to ensure that the agent
can simulate real - user operations and be compatible with multiple browsers
and platforms. Secondly, the LangGraph - based large language model agent
is driven by GPT-4o, and the agent is constructed through several key steps,
including page marking, prompt templates, GPT-4o, and functions for parsing
the output. In addition, the agent is equipped with six basic tools that simulate
common human web - browsing operations to build an action space, enabling it
to efficiently complete web - interaction tasks.
Through this research, not only are successful cases of the agent in Web
interaction demonstrated, but also the challenges that the agent faces in Web
interaction, such as page navigation and hallucination handling, are revealed.
Future research will focus on further optimizing the agent to improve its stability
and execution efficiency in the Web environment.
2 Related Work
Use executable code operations as the action method of the large language model
(LLM) agent [1]. UI-Hawk is a visual GUI agent that can process screens and
utilize historical screen information [2]. CodeGemma is a dedicated open-source
code model built on Gemma, capable of performing various code and natural
language generation tasks [3].
The ReAct method improves the model’s performance in language under-
standing and decision making tasks by integrating reasoning and action capa-
bilities in the language model [4]. WebArena is used to train and evaluate the
performance of autonomous agents capable of executing natural language com-
mands in real world tasks [5].
Language Agent Tree Search (LATS) is a general framework that integrates the
capabilities of LLMs in planning, execution, and reasoning [6]. A method to
endow an autonomous agent with dynamic memory and self reflection abilities,
aiming to enhance its reasoning and task specific operation capabilities [7].
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Different from the above methods, this study constructs a Web agent based on
the LangGraph framework. LangGraph is an agent building framework based on
directed graph topology. By defining nodes and edges, it can flexibly control the
task flow and dynamically update the state during runtime, making it suitable
for building agents for complex tasks.
Fig. 1. The workflow of the LangGraph based Web agent is as follows. First, the user
poses a question. The Web agent then receives the task and automatically operates
the web page using the Playwright tool. The agent takes a screenshot of the web page
and marks the text information on it. After thinking, the large language model decides
the next operation, and the agent executes the corresponding tool operations, such
as clicking, inputting text, etc. This process iterates in a loop until the final result is
obtained. Finally, the Web agent returns the processed answer to the user.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
languages (TypeScript, JavaScript, Python,.NET, Java). It can also test mobile
web applications. Playwright generates events through browser input to ensure
that the testing behavior is consistent with the actual operations of users.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Go to Search Engine (Bing): When the agent has difficulty obtaining the
required answer on the current website, it can navigate to the Bing search page
and start information retrieval again.
Through these tools that simulate human mouse and keyboard operations,
the agent can browse various web pages, locate the content needed to complete
the task, respond in a concise operation format, accurately locate the elements
that need to be interacted with, and execute the corresponding actions.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 2. Workflow Based on LangGraph.
– Instruction: Find a recipe for Baked Salmon that takes less than 30 minutes
to prepare and has at least a 4 - star rating based on user reviews.
– Action: Perform click operations in sequence, type the instruction in a spe-
cific element [19], and then click on elements numbered [45], [40], [42], etc.
– Final response: Best Ever Baked Salmon from Kristine’s Kitchen Blog
takes 30 minutes to prepare and has at least a 4 - star rating based on user
reviews.
– Image: with multiple labeled boxes on the page. The marked elements are
assigned a random colored border and numbered labels, showing various
Baked Salmon recipes.
5 Discussion
This section discusses the challenges faced by large language model agents in
completing tasks. These agents may fail when executing tasks and interacting
with Web page automation tests, such as imprecise navigation queries; inabil-
ity to interpret uncommon patterns during image analysis; hallucinations that
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 3. Baked Salmon Recipe Search based on agents.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 4. Shanghai Weather Query based on agents.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
ignore task requirements and inaccurate understanding of context, leading to
actions that do not align with the goals; and inconsistent prompts that fail to
follow instructions .
5.1 Improvement
– Use LangGraph’s Persistence Layer
• Store conversation history, agent state, and tool usage persistently.
• Support long-term user memory, allowing the agent to recall past inter-
actions.
– Enable Human-in-the-Loop (HITL)
• Introduce checkpointed execution, where human oversight can pause,
correct, and resume agent workflows when needed.
– Implement Multi-Session Memory
• The agent can remember past failed attempts and adjust behavior ac-
cordingly.
6 Conclusion
This research focuses on implementing a Web agent based on LangGraph, driven
by the multimodal large language model GPT-4o. Relying on an automated Web
browsing environment, multiple Web page operation tools have been developed.
Through a series of experimental studies, this research demonstrates successful
application cases of the agent in Web interactions, and also deeply explores the
challenges the agent faces in aspects such as page navigation and handling hal-
lucination problems. In the future, further systematic optimization is needed to
improve the stability and execution efficiency of the agent in the Web environ-
ment.
References
1. Yangyi Chen Xingyao Wang et al. Executable code actions elicit better llm agents.
Computing Research Repository, abs/2402.01030, 2024.
2. Yaqi Yu Jiwen Zhang et al. Ui-hawk: Unleashing the screen stream understanding
for gui agents. crossref, 2024.
3. Heri Zhao CodeGemma Team et al. Codegemma: Open code models based on
gemma. CoRR, abs/2406.11409, 2024.
4. Jeffrey Zhao Shunyu Yao et al. React: Synergizing reasoning and acting in language
models. Computing Research Repository, 2022.
5. Frank F. Xu Shuyan Zhou et al. Webarena: A realistic web environment for building
autonomous agents. Computing Research Repository, abs/2307.13854, 2023.
6. Kai Yan Andy Zhou et al. Language agent tree search unifies reasoning acting and
planning in language models. Computing Research Repository, abs/2310.04406,
2024.
7. Federico Cassano Noah Shinn et al. Reflexion: Language agents with verbal rein-
forcement learning. Computing Research Repository, 2023.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
8. Wenlin Yao Hongliang He et al. Webvoyager: Building an end-to-end web agent
with large multimodal models. Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics, pages 6864–6890, 2024.
9. Ruoxi Sun Hongjin Su et al. Learn-by-interact: A data-centric framework for self-
adaptive agents in realistic environments. CoRR, 2025.
10. Boyu Gou Boyuan Zheng et al. Gpt-4v(ision) is a generalist web agent, if grounded.
ICML, 2024.
https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved