Supercharging Web Agents with Vision Capabilities in Hyperbrowser

This cookbook demonstrates how adding vision capabilities to autonomous web agents dramatically improves their ability to navigate visually complex websites and extract information from visual elements that text-only agents struggle with.

We'll compare the same agent with and without vision capabilities on a real-world shopping task to showcase the difference in performance.

Prerequisites

You'll need a Hyperbrowser API key (sign up at hyperbrowser.ai if you don't have one).

Store your API key in a .env file in the notebook directory:

HYPERBROWSER_API_KEY=your_hyperbrowser_key_here

Step 1: Import Libraries and Set Up the Environment

We import the necessary libraries and initialize our Hyperbrowser client, which will handle our web browsing tasks.

import os
from dotenv import load_dotenv
from hyperbrowser import AsyncHyperbrowser
from hyperbrowser.models import StartBrowserUseTaskParams
from IPython.display import Markdown, display
load_dotenv()

Step 2: Initialize the Hyperbrowser Client

We create an instance of the AsyncHyperbrowser client using our API key.

hb = AsyncHyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))

Step 3: Define the Browser Agent Function

We create a function that initializes an autonomous browser agent with an option to enable or disable vision capabilities. The task involves finding a specific DVD on eBay – a complex visual e-commerce environment that requires understanding product listings, images, and filtering options.

async def get_shopping_agent(use_vision=False):
resp = await hb.agents.browser_use.start_and_wait(
StartBrowserUseTaskParams(
task="Find me the cheapest copy of Star Trek: The Next Generation on DVD on Ebay. Make sure it's a new copy. Return me the url, price, and shipping time.",
use_vision=use_vision
)
)
if resp.data is not None:
return resp.data.final_result
return None

Step 4: Execute the Agent WITHOUT Vision Capabilities

First, we run the agent with vision disabled. E-commerce sites like eBay have complex layouts with product information spread across images, dynamically loaded content, and various UI elements. Let's see how a text-only agent performs.

response = await get_shopping_agent()
if response is not None:
display(Markdown(response))
else:
print("No response from the agent")

I was unable to find a new copy of Star Trek: The Next Generation on DVD, full box set, and return the URL, price, and shipping time. The extraction tool consistently failed to extract the correct information, even after scrolling and applying filters. The results extracted were for individual seasons, not the complete series. I have reached the maximum number of steps.

Results Without Vision

As we can see from the output above, the agent without vision capabilities was unable to successfully navigate eBay and find any DVD listings. This is because eBay's interface relies heavily on visual elements - from product images to buttons and filters. Without being able to "see" these elements, the agent cannot effectively search for products or extract meaningful information from the listings.

The failure demonstrates how challenging it can be for text-only agents to operate in modern web interfaces that are designed primarily for visual interaction. While agents can infer many things from the structure of the html document, many crucial elements that a human would easily identify - such as the search bar, filter options, and product cards - are difficult or impossible for a non-vision agent to locate and interact with.

Step 5: Execute the Agent WITH Vision Capabilities

Now let's run the same task with vision capabilities enabled. The agent can now "see" the page like a human would, understanding images, visual layouts, and graphical elements that contain crucial information.

response = await get_shopping_agent(use_vision=True)
if response is not None:
display(Markdown(response))
else:
print("No response from the agent")

Results With Vision

The agent with vision capabilities was able to successfully navigate eBay and extract information about DVD listings. By being able to "see" and understand the visual elements of the page, including product images, buttons, and the overall layout, the agent could effectively:

  1. Locate and use the search functionality
  2. Find relevant DVD product listings
  3. Extract key details like prices, shipping costs, and product descriptions
  4. Navigate through the visual interface as a human would

This demonstrates how vision capabilities allow the agent to effectively interact with eBay's visually-rich interface and complete the assigned task successfully.

The Power of Vision in Web Agents

The contrast between the two results speaks volumes about the importance of vision capabilities for web agents:

  • Without Vision: The agent fails completely, unable to navigate eBay's interface effectively or extract the required information about products.

  • With Vision: The agent successfully identifies a product, extracts its URL, price, and shipping information with precision.

This demonstrates why vision capabilities are transformative for web agents operating in visually-rich environments like e-commerce platforms, social media sites, or any website with complex visual layouts and image-based content.

By simply adding the use_vision=True parameter, your agent gains human-like visual comprehension, dramatically improving its ability to complete tasks in the visual web. There is of course a trade off though. While we gain in fidelity, we also lose out on the costs associated, as vision agents are processing a significantly larger number of tokens.

Conclusion

Vision-enabled web agents represent a significant leap forward in autonomous web automation. They can:

  1. Understand and interpret visually complex websites
  2. Extract information from images and visual layouts
  3. Successfully complete tasks that text-only agents fail at
  4. Navigate interfaces designed for human visual perception

As demonstrated in this simple example, adding vision to your browser agents opens up entirely new capabilities and dramatically increases their success rate on complex web tasks.