GPT-4’s Maze Navigation: A Deep Dive into ReAct Agent and LLM’s Thoughts

10 min readMay 10, 2023

“A anxious looking robot navigating an ancient greek maze with stone walls.”

The Sparks

My colleagues at Microsoft Research recently published a paper that demonstrated GPT-4’s navigational and mapping capability.

[2303.12712] Sparks of Artificial General Intelligence: Early experiments with GPT-4 (arxiv.org)

They performed this demonstration through a chat with GPT-4. At the beginning of the chat, they provided the description of the navigation task. They gave feedback about the current position and possible next moves after each move by GPT-4. GPT-4 has no access to the complete map throughout the chat. It eventually arrived at the goal room, and generated a map of the house.

This navigational capability is astonishing on the surface: it suggests that GPT-4 has two-dimensional spatial awareness. How does it recognize and memorize the dead ends to avoid repeated visits? How does it plan the next move?

That gave me an idea: what if we challenged GPT-4 to a maze game?

Let’s Build It — A Deep Dive in to ReAct Agent

To visualize the maze and GPT-4’s navigation in real time, I used a Python maze generator. This Python maze generator can create random n x m rectangular mazes with different levels of difficulty. It also has an API allowing me to visualize the moves.

I connected GPT-4 with the maze API using Langchain’s Agent framework. My complete code is only 87 lines of Python. Specifically, I used the default zero-shot-react-description agent and two tools:

look(), which returns a list of (x, y) coordinates for the next possible positions.
move(position), which moves the agent to a specified position.

Here are the tool definitions. These tools are member functions of a maze object called game, which stores the internal states and renders graphics.

tools = [
    Tool(
        name="look",
        func=game.look,
        description=(
          "A tool for checking the available next positions to move to. "
          "It takes an empty string as input and returns a list of possible "
          "next positions."
        ),
    ),
    Tool(
        name="move",
        func=game.move,
        description=(
          "A tool for moving in the maze. It takes an input of a target "
          "position (x, y), and returns the result of the move. If the "
          "move was successful, it returns the new position, otherwise it "
          "returns an error. Always use the look tool to check possible "
          "directions before making a move."
        ),
    ),
]

I gave GPT-4 the following instruction:

"""You are an agent in a rectangular maze with {rows} rows and {columns} 
columns. Each cell has a position (x, y) where x is the row index and y is 
the column index. You can move one cell at a time. There may be walls 
between some cells that you cannot move across. You must move toward the 
goal using as few steps as possible. You must think about which direction 
to move using A* search. Your start position is {start}. DO NOT STOP until 
you have successfully reached the goal position at {goal}."""

After everything was set up, GPT-4 started its navigation in a Thought-Action-Observation loop, as shown in the diagram below.

The Thought-Action-Observation loop (i.e., ReAct) used by GPT-4 for maze navigation.

The Thought-Action-Observation loop is a simple but powerful innovation that enables LLM to interact with external APIs, such as the maze object in this case. The loop is two-way conversation between an LLM (such as GPT-4) and a Python handler. It begins with a GPT-4’s response to the instructions. In the response, GPT-4 lays out its first thought and action:

Thought: I need to determine the possible next positions I can move to from my 
current position.
Action: look
Action Input: ""

This message from GPT-4 follows the structure defined in the system prompt for the ReAct agent in Langchain. The Python handler parses this message, calls the look() tool, and returns the next possible positions.

Observation: Next possible positions: (4, 3), (3, 4)

This text is then sent back to GPT-4 as a user message, and GPT-4 replies with a new thought and action:

Thought:I have two possible positions to move to. I will choose the 
position that has the lowest total cost, which is the sum of the 
movement cost (g) and the estimated cost to the goal (h) using A* search.
Action: move
Action Input: (4, 3)

The loop repeats with a different action: move((4,3)), updates the current position in the maze, and returns an observation:

Observation: Moved to position (4, 3).

The loop continues this way:

GPT-4 thinks out loud and replies with a structured action.
The Python handler executes the action and replies with an observation

This goes on until it reaches the goal. Then the Python handler replies:

Observation: Success! You have arrived at the goal at position (1, 1). 
Your path taken is (4, 3), (4, 2), (4, 1), (4, 2), (4, 3), (4, 4), (3, 4), (3, 3), (3, 2), (3, 1), (2, 1), (2, 2), (2, 3), (2, 4), (1, 4), (1, 3), (1, 2), (1, 1).

GPT-4 then says:

Thought:I now know the final answer.
Final Answer: I have successfully reached the goal at position (1, 1) by 
following the path (4, 3), (4, 2), (4, 1), (4, 2), (4, 3), (4, 4), (3, 4), 
(3, 3), (3, 2), (3, 1), (2, 1), (2, 2), (2, 3), (2, 4), (1, 4), (1, 3), 
(1, 2), (1, 1).

The Python handler terminates the conversation after this Final Answer.

GPT-4 on a 5x5 Maze (150x playback speed)

GPT-4 on a 6x6 Maze (150x playback speed)

GPT-4 navigates a 7x7 maze. It did not succeed and gave up.

GPT-4 performed pretty well up to size 6x6. It avoided getting trapped in loops, but it did not always take the optimal path. This is not surprising because it only had local information of its immediate surrounding and had to decided based on that. Its memory usage is limited to past moves and observations.

It is also worth noting that GPT-4 follows instruction very faithfully, so clear instructions are enough. This is why I think GPT-4 is superior to other LLMs. Those other ones requires extensive prompt-tuning.

The result is quite interesting. There is a remaining question: how does GPT-4 navigate the maze exactly? It says A* search in its thought, but it is not clear whether it used any special algorithm.

Examine the Thoughts of GPT-4

My tweet about GPT-4 maze navigator

Twitter discussions rarely go into any depth. My tweet was no exception. A month has passed and the field of LLM has already gone through several revolutions, but no one has ever followed up on the actual navigation mechanism used by GPT-4. I decided to look into it.

My result shows that GPT-4 clearly learned a pretty good get_next_move(past_moves, past_observations) function through pre-training on next-token prediction. That’s pretty impressive. Some skeptics may suggest that this get_next_move function is just pattern matching against mazes in the training data. I doubt that, as the number of possible mazes is grows exponentially with respect to the maze size. This table lists the number of possible mazes given the number of rows and columns.

================================================================
m\n| 1   2     3       4         5           6             7
---+------------------------------------------------------------
1  | 1   1     1       1         1           1             1 ...
2  | 1   3     8      20        49         119           288 ...
3  | 1   8    45     234      1193        6049         30616 ...
4  | 1  20   234    2423     24455      245972       2473317 ...
5  | 1  49  1193   24455    482443     9469361     185899132 ...
6  | 1 119  6049  245972   9469361   360923899   13742823032 ...
7  | 1 288 30616 2473317 185899132 13742823032 1012326365581 ...
  ...

https://oeis.org/A359573

Based on this table, there are 360923899 different 6x6 mazes. There are probably not that many mazes in the training data.

Some recent academic work looks into the explainability aspect of large language models and finds that their own explanation are sometimes quite good. So, we should take a closer look into the thoughts of GPT-4 when it came up with moves. Let’s examine the thoughts in GPT-4’s messages, and see what techniques does it employes.

Memorizing

This is the first thought I found quite telling:

Observation: Next possible positions: (6, 5), (6, 3), (5, 4)
Thought: Since I have already visited the positions (6, 5) and (5, 4), 
I will now move to (6, 3) as it is the only unvisited position.
Action: move
Action Input: (6, 3)

It clearly uses information from chat history about visited positions to exclude them from possible next moves. We can certainly say that the get_next_move function has access to memory of historical moves.

Backtracking

The following thought shows how GPT-4 does backtracking from (4, 1) to (5, 2) when navigating the 5x5 maze shown earlier:

Observation: Next possible positions: (5, 1)
Thought:I have only one possible next position (5, 1). This will move me 
back to a previous position, which is not helpful. Therefore, I'll 
backtrack to the previous position (5, 1) and try another path.
Action: move
Action Input: (5, 1)

Observation: Moved to position (5, 1).
Thought:I have returned to position (5, 1). Now, I will take the 
untried path from position (5, 2) to (4, 2).
Action: move
Action Input: (5, 2)

Observation: Moved to position (5, 2).
Thought:I have moved to position (5, 2). Now, I will take the untried 
path from position (5, 2) to (4, 2).
Action: move
Action Input: (4, 2)

GPT-4 also showed evidence of backtracking when navigating the 7x7 maze.

Dead-end Labeling

Backtracking is not enough to avoid getting into the same dead-end alley twice. In this thought below, GPT-4 labels a sub-portion of the maze as a dead-end. Humans also use this technique through associative memory when exploring real-life or virtual environments. I think this is probably the most effective technique, and GPT-4 uses it reliably.

Observation: Moved to position (7, 7).
Thought:Now, I am back at the starting position (7, 7). I need to check 
the available next positions again and choose a different path.
Action: look
Action Input: ""
Observation: Next possible positions: (7, 6), (6, 7)
Thought:I have two possible next positions: (7, 6) and (6, 7). Since I 
previously explored the path starting from (7, 6) and reached a dead-end, 
I will now choose the position (6, 7) to explore a different path.

Action: move
Action Input: (6, 7)

A* Search?

In the prompt I suggested using A* search algorithm, which is to use a simple heuristic function for deciding between multiple locally equivalent moves. A typical heuristic is to compare possible moves based on the Manhattan distance to the goal.

Unlike previous memory-based techniques, GPT-4 does not use A* search reliably: it sometimes uses the algorithm, but often times it simply chooses at random. The following example shows how GPT-4 uses A* search.

Observation: Moved to position (5, 4).
Thought:Now, I am at position (5, 4). I need to check the available next positions again.
Action: look
Action Input: ""
Observation: Next possible positions: (5, 5), (5, 3), (6, 4)
Thought:I have three possible next positions: (5, 5), (5, 3), and (6, 4). I should calculate the cost of each position, which includes the cost from the current position and the heuristic cost to the goal position (1, 1). Then, I will choose the position with the lowest total cost. I will not consider (6, 4) since it is the previous position.

Cost Calculation (Position, Movement Cost + Heuristic Cost) :
- (5, 5): 1 + abs(5-1) + abs(5-1) = 1 + 4 + 4 = 9
- (5, 3): 1 + abs(5-1) + abs(3-1) = 1 + 4 + 2 = 7

I will move to the position (5, 3) since it has the lowest total cost.
Action: move
Action Input: (5, 3)

What have we learned?

Based on the evidence collected so far, I can confidently say that GPT-4 has a good grasp of memorization-based navigation techniques such as backtracking and dead-end labeling. However it does not fully possess advanced planning skill such as using the A* search algorithm. What is an explanation for this?

One hypothesis is that since GPT-4 is a transformer, it produces one token at a time and appends each new token to the input for producing the next token. In that sense, it always thinks out loud in a linear chain of thoughts. Moreover, since a transformer produces each token in a constant number of computational steps, it has a limited ability to explore any non-linear search space, such as a maze. GPT-4 performs poorly compared to classical algorithms like Min-Max and Monte-Carlo search. Those algorithms do not think out loud, they plan internally and output with confidence about the expected outcome.

One may object to this hypothesis and argue that we can do better prompt engineering to “coerce” GPT-4 to always follow an advanced planning algorithm, and hook it up with a Python code interpreter tool for executing the algorithm. I think if we do that, we will not be evaluating the raw intelligence of GPT-4, but merely using it as a code interpreter that handles structured input and output — why not just write the code?

In summary, I think we still know very little about the capabilities and limitations of large language models. While training GPT-4 took only a couple of months, it may take years for human to fully understand what we have created.

The code for this blog post can be found on Github.