Skip to main content
Crawl starts from a URL and follows links across the site, returning content from each page in the formats you choose—markdown, HTML, links, screenshots, or structured JSON. It shares the same output options as Fetch, applied to every page it visits.
For the full API schema, see the Crawl API Reference.

Quick Start

1

Install the SDK

npm install @hyperbrowser/sdk
2

Crawl a website

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  crawlOptions: {
    maxPages: 10,
    followLinks: true,
  },
});

console.log(result);
Crawling is asynchronous—start the job and then poll for results. The SDKs provide a startAndWait / start_and_wait convenience method that handles polling for you.

Response

Starting a crawl job returns a jobId:
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
Once complete, the full response includes an array of page results under data:
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "metadata": {
        "title": "Example Domain",
        "sourceURL": "https://example.com"
      },
      "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
    },
    {
      "url": "https://example.com/about",
      "status": "completed",
      "metadata": {
        "title": "About - Example Domain",
        "sourceURL": "https://example.com/about"
      },
      "markdown": "# About\n\nMore information about this example..."
    }
  ],
  "totalPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 0,
  "batchSize": 10
}
The status field on the job can be pending, running, completed, or failed. Each page in data also has its own status and may include an error field if that page failed.

Crawl Options

Control how the crawler traverses the site with crawlOptions:
FieldTypeDefaultDescription
crawlOptions.maxPagesnumber10Maximum number of pages to crawl (max: 100)
crawlOptions.followLinksbooleantrueWhether to follow links found on crawled pages
crawlOptions.ignoreSitemapbooleanfalseWhether to ignore the site’s sitemap
crawlOptions.includePatternsstring[][]URL patterns to include (only matching URLs are crawled)
crawlOptions.excludePatternsstring[][]URL patterns to exclude from crawling
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  crawlOptions: {
    maxPages: 50,
    followLinks: true,
    includePatterns: ["/docs/*", "/blog/*"],
    excludePatterns: ["/docs/archive/*"],
  },
});

console.log(result);

Outputs

Use outputs.formats to control what data is returned for each crawled page. This works the same as Fetch outputs—you can request markdown, HTML, links, screenshots, or structured JSON.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  outputs: {
    formats: ["markdown", "links"],
  },
  crawlOptions: {
    maxPages: 10,
  },
});

for (const page of result.data) {
  console.log(page.url, page.markdown?.slice(0, 100));
}
Pass a JSON Schema to extract structured data from each crawled page. You can use a raw JSON Schema object, a Zod schema (Node), or a Pydantic model (Python).
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
import { z } from "zod";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const PageSchema = z.object({
  heading: z.string(),
  description: z.string(),
});

const result = await client.web.crawl.startAndWait({
  url: "https://example.com",
  outputs: {
    formats: [
      "markdown",
      {
        type: "json",
        schema: PageSchema,
      },
    ],
  },
  crawlOptions: {
    maxPages: 5,
  },
});

for (const page of result.data) {
  console.log(page.url, page.json);
}
For the full list of output formats and options (screenshots, sanitization, selectors, storage state), see the Fetch outputs documentation.

Output Controls

Control what gets extracted from each crawled page. These work the same as Fetch output controls:
FieldTypeDefaultDescription
outputs.sanitizestring"none"Sanitize mode: "none", "basic", or "advanced"
outputs.includeSelectorsstring[][]CSS selectors to include (only matching elements returned)
outputs.excludeSelectorsstring[][]CSS selectors to exclude from output
outputs.storageStateobjectPre-seed localStorage/sessionStorage before fetching

Browser & Stealth

Configure how the cloud browser runs. These options apply to all pages in the crawl:
FieldTypeDefaultDescription
stealthstring"auto"Stealth mode: "none", "auto", or "ultra"
browser.profileIdstringReuse an existing browser profile
browser.solveCaptchasbooleanfalseEnable CAPTCHA solving
browser.screenobject{ width: 1280, height: 720 }Set viewport dimensions (width, height)
browser.locationobjectLocalize via proxy location (country, state, city). If set, proxy is enabled automatically
Control page load behavior and timing for each crawled page:
FieldTypeDefaultDescription
navigation.waitUntilstring"domcontentloaded"Load condition: "load", "domcontentloaded", or "networkidle"
navigation.waitFornumber0Milliseconds to wait after navigation completes before collecting outputs (0–30000)
navigation.timeoutMsnumber30000Max time (ms) to wait for navigation (1–60000)

Cache Controls

Control caching behavior for crawl results:
FieldTypeDefaultDescription
cache.maxAgeSecondsnumberCache control—cached results older than this are treated as stale. Set to 0 to bypass cache reads

Pagination

Crawl results are returned in batches. You can control pagination when retrieving results:
ParameterTypeDefaultDescription
pagenumber0Page batch index to retrieve
batchSizenumber10Number of page results per batch
The response includes pagination metadata:
FieldDescription
totalPagesTotal number of crawled pages
totalPageBatchesTotal number of result batches
currentPageBatchCurrent batch index
batchSizeNumber of results in each batch
When using startAndWait / start_and_wait with returnAllPages set to true (the default), the SDK automatically fetches all paginated results and combines them into a single response.