0% found this document useful (0 votes)
12 views

learn-by-interaction-advancing-agentic-ai-for-web-automation-with-lang-graph

Uploaded by

mohamed magdy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

learn-by-interaction-advancing-agentic-ai-for-web-automation-with-lang-graph

Uploaded by

mohamed magdy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Learn by Interaction: Advancing Agentic AI for

Web Automation with LangGraph

Jialin Wang1 and Zhihua Duan2


1
Executive Vice President,Ferret Relationship Intelligence
Burlingame, CA 94010, USA
[email protected]
https://www.linkedin.com/in/starspacenlp/
2
Intelligent Cloud Network Monitoring Department
China Telecom Shanghai Company,Shanghai, China
[email protected]

Abstract. With the rapid development of large language model tech-


nology, Web agents, as a key technology for automated Web interaction,
have gradually become a research hotspot. In this study, a LangGraph -
based Web agent was designed and implemented. Driven by the multi-
modal large language model GPT-4o, and through the automated Web
browsing environment Playwright, multiple Web page operation tools
were realized. The research demonstrated successful cases of the agent in
Web interaction, and at the same time, revealed its challenges in aspects
such as page navigation and hallucination handling. Future research will
focus on optimizing the agent to improve its stability and execution ef-
ficiency in the Web environment.

Keywords: Large Language Model · Web Agent · LangChain · Lang-


Graph · GPT-4o · Hallucination .

1 Introduction

With the rapid development of large language model technology, Web agents,
as one of the key technologies for automated Web interaction, are gradually be-
coming a hot topic in research. Web agents can simulate the behavior of human
users, automatically browse web pages, execute tasks, and obtain information,
which is of great significance for improving information retrieval efficiency, auto-
mated testing, and enhancing user experience. However, building an efficient and
stable Web agent faces numerous challenges, including page navigation, visual
positioning, hallucination handling, and other issues. This research aims to de-
sign and implement a LangGraph - based Web agent, driven by the multimodal
large language model GPT-4o, and to realize multiple Web page operation tools
through the automated Web browsing environment Playwright.
The methodological design of this research includes the construction of the
LangGraph system architecture, the implementation of the Web browsing envi-
ronment, and the building of the Web agent. Firstly, a powerful automated Web

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
browsing environment is set up through Playwright to ensure that the agent
can simulate real - user operations and be compatible with multiple browsers
and platforms. Secondly, the LangGraph - based large language model agent
is driven by GPT-4o, and the agent is constructed through several key steps,
including page marking, prompt templates, GPT-4o, and functions for parsing
the output. In addition, the agent is equipped with six basic tools that simulate
common human web - browsing operations to build an action space, enabling it
to efficiently complete web - interaction tasks.
Through this research, not only are successful cases of the agent in Web
interaction demonstrated, but also the challenges that the agent faces in Web
interaction, such as page navigation and hallucination handling, are revealed.
Future research will focus on further optimizing the agent to improve its stability
and execution efficiency in the Web environment.

2 Related Work

2.1 Large Language Model based Agents

Use executable code operations as the action method of the large language model
(LLM) agent [1]. UI-Hawk is a visual GUI agent that can process screens and
utilize historical screen information [2]. CodeGemma is a dedicated open-source
code model built on Gemma, capable of performing various code and natural
language generation tasks [3].
The ReAct method improves the model’s performance in language under-
standing and decision making tasks by integrating reasoning and action capa-
bilities in the language model [4]. WebArena is used to train and evaluate the
performance of autonomous agents capable of executing natural language com-
mands in real world tasks [5].

2.2 Reinforcement Learning based Agents

Language Agent Tree Search (LATS) is a general framework that integrates the
capabilities of LLMs in planning, execution, and reasoning [6]. A method to
endow an autonomous agent with dynamic memory and self reflection abilities,
aiming to enhance its reasoning and task specific operation capabilities [7].

2.3 Web based Agent

WebVoyager, an end-to-end web agent based on large multimodal models (LMM)


[8].The Learn by interact framework is proposed. It generates the interaction
trajectories between the document - generating agent and the environment to
improve the agent’s task performance in real - world environments[9]. The po-
tential of GPT-4O(Vision) as a general purpose web agent is explored, and the
SEEACT framework is proposed to integrate visual understanding and web ac-
tion capabilities[10].

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Different from the above methods, this study constructs a Web agent based on
the LangGraph framework. LangGraph is an agent building framework based on
directed graph topology. By defining nodes and edges, it can flexibly control the
task flow and dynamically update the state during runtime, making it suitable
for building agents for complex tasks.

3 Systematic Method Design


As shown in Figure 1, in the methodology of this study, the focus is on construct-
ing a Web agent based on LangGraph. Playwright is used to build an automated
Web browsing environment. Its strong compatibility and cross - platform fea-
tures ensure that it can simulate real user operations. The agent is driven by the
multimodal large - language model GPT-4o. Multiple key steps are connected
through an instance of the RunnablePassthrough object, including functions for
marking pages, prompt templates, GPT-4o, and functions for parsing the out-
put. At the same time, the agent is equipped with six basic tools that simulate
common operations when humans browse web pages, thus constructing an action
space to enable it to efficiently complete web interaction tasks.

Fig. 1. The workflow of the LangGraph based Web agent is as follows. First, the user
poses a question. The Web agent then receives the task and automatically operates
the web page using the Playwright tool. The agent takes a screenshot of the web page
and marks the text information on it. After thinking, the large language model decides
the next operation, and the agent executes the corresponding tool operations, such
as clicking, inputting text, etc. This process iterates in a loop until the final result is
obtained. Finally, the Web agent returns the processed answer to the user.

3.1 Web browsing environment


In this study, Playwright is used to implement an automated web browsing
environment. Playwright is a powerful end to end testing tool. It supports all
browsers (such as Chromium, WebKit, and Firefox), is cross platform (compat-
ible with Windows, Linux, and macOS), and supports multiple programming

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
languages (TypeScript, JavaScript, Python,.NET, Java). It can also test mobile
web applications. Playwright generates events through browser input to ensure
that the testing behavior is consistent with the actual operations of users.

3.2 Construction of Web Agents


The LangGraph based large model agent is driven by the multimodal large lan-
guage model (GPT-4o). In the instance of the RunnablePassthrough object,
multiple steps are connected through the pipeline operator. It is mainly com-
posed of the following runnable objects:
1. Define an asynchronous function mark_page that takes a page object as
a parameter and returns a dictionary containing the page screenshot and
bounding box information.
2. Design a prompt template to guide a robot that simulates human web brows-
ing to perform tasks. The template includes system messages, placeholder
messages, and human messages, which are used to describe tasks and oper-
ation guidelines, thinking information and operation records, as well as to
provide web page screenshots, bounding - box descriptions, and task descrip-
tions input by users, respectively. The template also defines input variables,
optional variables, input types, partial variables, metadata, and message
templates for configuring and managing the use of the prompt template.
3. Use the GPT-4o large language model to decide the next operation. GPT-
4o is a multimodal language model launched by OpenAI. It can process
image content, including functions such as image understanding, description,
question answering, image generation based on text, style transfer, etc., and
also supports the integrated interaction of images, text, and voice.
4. Define a parsing function parse to parse the text output generated by the
large language model, extract the action and parameter information from it,
and convert it into a dictionary format.

3.3 Construct the Action Space


The LangGraph based large model agent is equipped with six basic tools that
simulate common operations when humans browse web pages, forming the agent’s
action space:
Click: Perform a click operation on the bounding box marked with a numer-
ical label in the screenshot. This is usually used to click elements such as links
and buttons on the web page.
Input: Enter the specified text into the text box.
Scroll: Execute a scrolling operation on the web page according to the given
scrolling parameters.
Wait: Set a waiting time, which is used to wait for the web page to finish
loading, for page elements to appear, or for other situations that require waiting.
Back: Execute the backward operation in the web browser. This is often used
to simulate the user’s navigation behavior on the web page, such as returning to
the previously browsed page.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Go to Search Engine (Bing): When the agent has difficulty obtaining the
required answer on the current website, it can navigate to the Bing search page
and start information retrieval again.
Through these tools that simulate human mouse and keyboard operations,
the agent can browse various web pages, locate the content needed to complete
the task, respond in a concise operation format, accurately locate the elements
that need to be interacted with, and execute the corresponding actions.

4 Experimental Design and Methods


4.1 Agent Workflow Based on LangGraph
As shown in Figure 2, it presents the implementation flowchart of the LangGraph
based Web Agent, depicting the operational steps taken by the Web agent when
performing a task.
Start: The process is initiated from the "start" node.
Web Agent: This is the core node of the process. A Web agent is con-
structed, consisting of multiple steps forming a pipeline, including steps such as
page marking, prompt design, large - model invocation, and output parsing.
Operation Nodes: Starting from the Web Agent node, there are multiple
operation nodes, which represent different actions executed by the agent:
– Click: The click operation.
– GoBack: Go back to the previous page.
– Bing: Access Bing.
– Scroll: Scroll the page.
– Type: Input text.
– Wait: The wait operation.
Revise Draft Area : This is an intermediate node. All operation nodes,
except the end node, point to this node, indicating that after performing a certain
operation, it will enter the revision area for further processing or modification.
The draft area stores the intermediate results or observation information of the
agent during the task - execution process. In this way, the agent can be aware
of the previous steps, and update the agent’s draft area after each tool call, so
that the agent can track its own execution steps and observation results.
End: The process finally reaches the "end" node, indicating the completion
of the task.
The entire flowchart demonstrates a series of operations performed by the
Web Agent during the task execution process. It updates the draft area at the
revision node and continuously updates and optimizes through multiple rounds
of cyclic iterations until the task is finally accomplished.

4.2 Web Agent case study


As shown in Figure 3, the WEB Agent based on LangGraph searches for a Baked
Salmon recipe that meets specific criteria using a web - based interface. This is
an example of the agent’s operation.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 2. Workflow Based on LangGraph.

– Instruction: Find a recipe for Baked Salmon that takes less than 30 minutes
to prepare and has at least a 4 - star rating based on user reviews.
– Action: Perform click operations in sequence, type the instruction in a spe-
cific element [19], and then click on elements numbered [45], [40], [42], etc.
– Final response: Best Ever Baked Salmon from Kristine’s Kitchen Blog
takes 30 minutes to prepare and has at least a 4 - star rating based on user
reviews.
– Image: with multiple labeled boxes on the page. The marked elements are
assigned a random colored border and numbered labels, showing various
Baked Salmon recipes.

As shown in Figure 4, the WEB Agent based on LangGraph queries the


weather - related information in Shanghai today using a web - based interface
(Microsoft Bing in this case). This is an example of the agent’s operation.

– Instruction: What’s the weather like in Shanghai today?


– Action: Type the instruction in element [20], i.e., ["20", "What’s the weather
like in Shanghai today?"].
– Final response: It’s currently 8◦ C with rain in Shanghai and will continue
raining for at least the next 2 hours.
– Image: with multiple labeled boxes on the page. The marked elements are
assigned a random colored border and numbered labels, showing the weather
- related query results on Microsoft Bing for Shanghai.

5 Discussion
This section discusses the challenges faced by large language model agents in
completing tasks. These agents may fail when executing tasks and interacting
with Web page automation tests, such as imprecise navigation queries; inabil-
ity to interpret uncommon patterns during image analysis; hallucinations that

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 3. Baked Salmon Recipe Search based on agents.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
Fig. 4. Shanghai Weather Query based on agents.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
ignore task requirements and inaccurate understanding of context, leading to
actions that do not align with the goals; and inconsistent prompts that fail to
follow instructions .

5.1 Improvement
– Use LangGraph’s Persistence Layer
• Store conversation history, agent state, and tool usage persistently.
• Support long-term user memory, allowing the agent to recall past inter-
actions.
– Enable Human-in-the-Loop (HITL)
• Introduce checkpointed execution, where human oversight can pause,
correct, and resume agent workflows when needed.
– Implement Multi-Session Memory
• The agent can remember past failed attempts and adjust behavior ac-
cordingly.

6 Conclusion
This research focuses on implementing a Web agent based on LangGraph, driven
by the multimodal large language model GPT-4o. Relying on an automated Web
browsing environment, multiple Web page operation tools have been developed.
Through a series of experimental studies, this research demonstrates successful
application cases of the agent in Web interactions, and also deeply explores the
challenges the agent faces in aspects such as page navigation and handling hal-
lucination problems. In the future, further systematic optimization is needed to
improve the stability and execution efficiency of the agent in the Web environ-
ment.

References
1. Yangyi Chen Xingyao Wang et al. Executable code actions elicit better llm agents.
Computing Research Repository, abs/2402.01030, 2024.
2. Yaqi Yu Jiwen Zhang et al. Ui-hawk: Unleashing the screen stream understanding
for gui agents. crossref, 2024.
3. Heri Zhao CodeGemma Team et al. Codegemma: Open code models based on
gemma. CoRR, abs/2406.11409, 2024.
4. Jeffrey Zhao Shunyu Yao et al. React: Synergizing reasoning and acting in language
models. Computing Research Repository, 2022.
5. Frank F. Xu Shuyan Zhou et al. Webarena: A realistic web environment for building
autonomous agents. Computing Research Repository, abs/2307.13854, 2023.
6. Kai Yan Andy Zhou et al. Language agent tree search unifies reasoning acting and
planning in language models. Computing Research Repository, abs/2310.04406,
2024.
7. Federico Cassano Noah Shinn et al. Reflexion: Language agents with verbal rein-
forcement learning. Computing Research Repository, 2023.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved
8. Wenlin Yao Hongliang He et al. Webvoyager: Building an end-to-end web agent
with large multimodal models. Proceedings of the 62nd Annual Meeting of the
Association for Computational Linguistics, pages 6864–6890, 2024.
9. Ruoxi Sun Hongjin Su et al. Learn-by-interact: A data-centric framework for self-
adaptive agents in realistic environments. CoRR, 2025.
10. Boyu Gou Boyuan Zheng et al. Gpt-4v(ision) is a generalist web agent, if grounded.
ICML, 2024.

https://doi.org/10.33774/coe-2025-b0gbv Content not peer-reviewed by Cambridge University Press. License: All Rights Reserved

You might also like