Crawl - Hyperbrowser

Crawl starts from a URL and follows links across the site, returning content from each page in the formats you choose—markdown, HTML, links, screenshots, structured JSON, or a branding profile. It shares the same output options as Fetch, applied to every page it visits.

For the full API schema, see the Crawl API Reference.

Quick Start

Install the SDK

npm install @hyperbrowser/sdk

Crawl a website

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  crawlOptions: {
    maxPages: 10,
    followLinks: true,
  },
});

console.log(result);

Crawling is asynchronous—start the job and then poll for results. The SDKs provide a startAndWait / start_and_wait convenience method that handles polling for you.

Response

Starting a crawl job returns a jobId:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

Once complete, the full response includes an array of page results under data:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "metadata": {
        "title": "Example Domain",
        "sourceURL": "https://example.com"
      },
      "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
    },
    {
      "url": "https://example.com/about",
      "status": "completed",
      "metadata": {
        "title": "About - Example Domain",
        "sourceURL": "https://example.com/about"
      },
      "markdown": "# About\n\nMore information about this example..."
    }
  ],
  "totalPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 0,
  "batchSize": 10
}

The status field on the job can be pending, running, completed, or failed. Each page in data also has its own status and may include an error field if that page failed.

Crawl Options

Control how the crawler traverses the site with crawlOptions:

Field	Type	Default	Description
`crawlOptions.maxPages`	`number`	`10`	Maximum number of pages to crawl (max: 100)
`crawlOptions.followLinks`	`boolean`	`true`	Whether to follow links found on crawled pages
`crawlOptions.ignoreSitemap`	`boolean`	`false`	Whether to ignore the site’s sitemap
`crawlOptions.includePatterns`	`string[]`	`[]`	URL patterns to include (only matching URLs are crawled)
`crawlOptions.excludePatterns`	`string[]`	`[]`	URL patterns to exclude from crawling

Example with URL filters

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  crawlOptions: {
    maxPages: 50,
    followLinks: true,
    includePatterns: ["/docs/*", "/blog/*"],
    excludePatterns: ["/docs/archive/*"],
  },
});

console.log(result);

Outputs

Use outputs.formats to control what data is returned for each crawled page. This works the same as Fetch outputs—you can request markdown, HTML, links, screenshots, structured JSON, or a branding profile.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  outputs: {
    formats: ["markdown", "links"],
  },
  crawlOptions: {
    maxPages: 10,
  },
});

for (const page of result.data) {
  console.log(page.url, page.markdown?.slice(0, 100));
}

Structured JSON extraction per page

Pass a JSON Schema to extract structured data from each crawled page. You can use a raw JSON Schema object, a Zod schema (Node), or a Pydantic model (Python).

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
import { z } from "zod";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const PageSchema = z.object({
  heading: z.string(),
  description: z.string(),
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  outputs: {
    formats: [
      "markdown",
      {
        type: "json",
        schema: PageSchema,
      },
    ],
  },
  crawlOptions: {
    maxPages: 5,
  },
});

for (const page of result.data) {
  console.log(page.url, page.json);
}

For the full list of output formats and options (screenshots, sanitization, selectors, storage state), see the Fetch outputs documentation.

Output Controls

Control what gets extracted from each crawled page. These work the same as Fetch output controls:

Field	Type	Default	Description
`outputs.sanitize`	`string`	`"none"`	Sanitize mode: `"none"`, `"basic"`, or `"advanced"`
`outputs.includeSelectors`	`string[]`	`[]`	CSS selectors to include (only matching elements returned)
`outputs.excludeSelectors`	`string[]`	`[]`	CSS selectors to exclude from output
`outputs.storageState`	`object`	—	Pre-seed localStorage/sessionStorage before fetching

Browser & Stealth

Configure how the cloud browser runs. These options apply to all pages in the crawl:

Field	Type	Default	Description
`stealth`	`string`	`"auto"`	Stealth mode: `"none"`, `"auto"`, or `"ultra"`
`browser.profileId`	`string`	—	Reuse an existing browser profile
`browser.solveCaptchas`	`boolean`	`false`	Enable CAPTCHA solving
`browser.screen`	`object`	`{ width: 1280, height: 720 }`	Set viewport dimensions (`width`, `height`)
`browser.location`	`object`	—	Localize via proxy location (`country`, `state`, `city`). If set, proxy is enabled automatically

Control page load behavior and timing for each crawled page:

Field	Type	Default	Description
`navigation.waitUntil`	`string`	`"domcontentloaded"`	Load condition: `"load"`, `"domcontentloaded"`, or `"networkidle"`
`navigation.waitFor`	`number`	`0`	Milliseconds to wait after navigation completes before collecting outputs (0–30000)
`navigation.timeoutMs`	`number`	`30000`	Max time (ms) to wait for navigation (1–60000)

Cache Controls

Control caching behavior for crawl results:

Field	Type	Default	Description
`cache.maxAgeSeconds`	`number`	—	Cache control—cached results older than this are treated as stale. Set to `0` to bypass cache reads

Pagination

Crawl results are returned in batches. You can control pagination when retrieving results:

Parameter	Type	Default	Description
`page`	`number`	`0`	Page batch index to retrieve
`batchSize`	`number`	`10`	Number of page results per batch

The response includes pagination metadata:

Field	Description
`totalPages`	Total number of crawled pages
`totalPageBatches`	Total number of result batches
`currentPageBatch`	Current batch index
`batchSize`	Number of results in each batch

When using startAndWait / start_and_wait with returnAllPages set to true (the default), the SDK automatically fetches all paginated results and combines them into a single response.

​Quick Start

​Response

​Crawl Options

​Outputs

​Output Controls

​Browser & Stealth

​Navigation Controls

​Cache Controls

​Pagination

Quick Start

Response

Crawl Options

Outputs

Output Controls

Browser & Stealth

Navigation Controls

Cache Controls

Pagination