> ## Documentation Index
> Fetch the complete documentation index at: https://hyperbrowser.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawl

> Crawl a website and get structured data from multiple pages

Crawl starts from a URL and follows links across the site, returning content from each page in the formats you choose—markdown, HTML, links, screenshots, structured JSON, or a branding profile. It shares the same output options as [Fetch](/web/fetch), applied to every page it visits.

<Info>
  For the full API schema, see the [Crawl API Reference](/api-reference/start-a-web-crawl-job).
</Info>

***

## Quick Start

<Steps>
  <Step title="Install the SDK">
    <CodeGroup>
      ```bash npm theme={null}
      npm install @hyperbrowser/sdk
      ```

      ```bash yarn theme={null}
      yarn add @hyperbrowser/sdk
      ```

      ```bash pip theme={null}
      pip install hyperbrowser
      ```

      ```bash uv theme={null}
      uv add hyperbrowser
      ```
    </CodeGroup>
  </Step>

  <Step title="Crawl a website">
    <CodeGroup>
      ```typescript Node theme={null}
      import { Hyperbrowser } from "@hyperbrowser/sdk";
      import { config } from "dotenv";

      config();

      const client = new Hyperbrowser({
        apiKey: process.env.HYPERBROWSER_API_KEY,
      });

      const result = await client.web.crawl.startAndWait({
        url: "https://example.com",
        crawlOptions: {
          maxPages: 10,
          followLinks: true,
        },
      });

      console.log(result);
      ```

      ```python Python theme={null}
      import os
      from dotenv import load_dotenv
      from hyperbrowser import Hyperbrowser
      from hyperbrowser.models import StartWebCrawlJobParams, WebCrawlOptions

      load_dotenv()

      client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))

      result = client.web.crawl.start_and_wait(
          StartWebCrawlJobParams(
              url="https://example.com",
              crawl_options=WebCrawlOptions(
                  max_pages=10,
                  follow_links=True,
              ),
          )
      )

      print(result)
      ```

      ```bash cURL theme={null}
      # Start a crawl job
      curl -X POST https://api.hyperbrowser.ai/api/web/crawl \
        -H 'Content-Type: application/json' \
        -H 'x-api-key: <YOUR_API_KEY>' \
        -d '{
          "url": "https://example.com",
          "crawlOptions": {
            "maxPages": 10,
            "followLinks": true
          }
        }'

      # Get crawl job results
      curl https://api.hyperbrowser.ai/api/web/crawl/{jobId} \
        -H 'x-api-key: <YOUR_API_KEY>'
      ```
    </CodeGroup>
  </Step>
</Steps>

<Info>
  Crawling is asynchronous—start the job and then poll for results. The SDKs provide a `startAndWait` / `start_and_wait` convenience method that handles polling for you.
</Info>

***

## Response

Starting a crawl job returns a `jobId`:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
```

Once complete, the full response includes an array of page results under `data`:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": [
    {
      "url": "https://example.com",
      "status": "completed",
      "metadata": {
        "title": "Example Domain",
        "sourceURL": "https://example.com"
      },
      "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples..."
    },
    {
      "url": "https://example.com/about",
      "status": "completed",
      "metadata": {
        "title": "About - Example Domain",
        "sourceURL": "https://example.com/about"
      },
      "markdown": "# About\n\nMore information about this example..."
    }
  ],
  "totalPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 0,
  "batchSize": 10
}
```

The `status` field on the job can be `pending`, `running`, `completed`, or `failed`. Each page in `data` also has its own `status` and may include an `error` field if that page failed.

***

## Crawl Options

Control how the crawler traverses the site with `crawlOptions`:

| Field                          | Type       | Default | Description                                              |
| ------------------------------ | ---------- | ------- | -------------------------------------------------------- |
| `crawlOptions.maxPages`        | `number`   | `10`    | Maximum number of pages to crawl (max: 100)              |
| `crawlOptions.followLinks`     | `boolean`  | `true`  | Whether to follow links found on crawled pages           |
| `crawlOptions.ignoreSitemap`   | `boolean`  | `false` | Whether to ignore the site's sitemap                     |
| `crawlOptions.includePatterns` | `string[]` | `[]`    | URL patterns to include (only matching URLs are crawled) |
| `crawlOptions.excludePatterns` | `string[]` | `[]`    | URL patterns to exclude from crawling                    |

<Accordion title="Example with URL filters">
  <CodeGroup>
    ```typescript Node theme={null}
    import { Hyperbrowser } from "@hyperbrowser/sdk";
    import { config } from "dotenv";

    config();

    const client = new Hyperbrowser({
      apiKey: process.env.HYPERBROWSER_API_KEY,
    });

    const result = await client.web.crawl.startAndWait({
      url: "https://example.com",
      crawlOptions: {
        maxPages: 50,
        followLinks: true,
        includePatterns: ["/docs/*", "/blog/*"],
        excludePatterns: ["/docs/archive/*"],
      },
    });

    console.log(result);
    ```

    ```python Python theme={null}
    import os
    from dotenv import load_dotenv
    from hyperbrowser import Hyperbrowser
    from hyperbrowser.models import StartWebCrawlJobParams, WebCrawlOptions

    load_dotenv()

    client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))

    result = client.web.crawl.start_and_wait(
        StartWebCrawlJobParams(
            url="https://example.com",
            crawl_options=WebCrawlOptions(
                max_pages=50,
                follow_links=True,
                include_patterns=["/docs/*", "/blog/*"],
                exclude_patterns=["/docs/archive/*"],
            ),
        )
    )

    print(result)
    ```

    ```bash cURL theme={null}
    curl -X POST https://api.hyperbrowser.ai/api/web/crawl \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
        "url": "https://example.com",
        "crawlOptions": {
          "maxPages": 50,
          "followLinks": true,
          "includePatterns": ["/docs/*", "/blog/*"],
          "excludePatterns": ["/docs/archive/*"]
        }
      }'
    ```
  </CodeGroup>
</Accordion>

***

## Outputs

Use `outputs.formats` to control what data is returned for each crawled page. This works the same as [Fetch outputs](/web/fetch#outputs)—you can request markdown, HTML, links, screenshots, structured JSON, or a branding profile.

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const result = await client.web.crawl.startAndWait({
    url: "https://example.com",
    outputs: {
      formats: ["markdown", "links"],
    },
    crawlOptions: {
      maxPages: 10,
    },
  });

  for (const page of result.data) {
    console.log(page.url, page.markdown?.slice(0, 100));
  }
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartWebCrawlJobParams, WebCrawlOptions, FetchOutputOptions

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))

  result = client.web.crawl.start_and_wait(
      StartWebCrawlJobParams(
          url="https://example.com",
          outputs=FetchOutputOptions(
              formats=["markdown", "links"]
          ),
          crawl_options=WebCrawlOptions(max_pages=10),
      )
  )

  for page in result.data:
      print(page.url, page.markdown[:100] if page.markdown else "")
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.hyperbrowser.ai/api/web/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
      "url": "https://example.com",
      "outputs": {
        "formats": ["markdown", "links"]
      },
      "crawlOptions": {
        "maxPages": 10
      }
    }'
  ```
</CodeGroup>

<Accordion title="Structured JSON extraction per page">
  Pass a JSON Schema to extract structured data from each crawled page. You can use a raw JSON Schema object, a Zod schema (Node), or a Pydantic model (Python).

  <CodeGroup>
    ```typescript Node theme={null}
    import { Hyperbrowser } from "@hyperbrowser/sdk";
    import { config } from "dotenv";
    import { z } from "zod";

    config();

    const client = new Hyperbrowser({
      apiKey: process.env.HYPERBROWSER_API_KEY,
    });

    const PageSchema = z.object({
      heading: z.string(),
      description: z.string(),
    });

    const result = await client.web.crawl.startAndWait({
      url: "https://example.com",
      outputs: {
        formats: [
          "markdown",
          {
            type: "json",
            schema: PageSchema,
          },
        ],
      },
      crawlOptions: {
        maxPages: 5,
      },
    });

    for (const page of result.data) {
      console.log(page.url, page.json);
    }
    ```

    ```python Python theme={null}
    import os
    from dotenv import load_dotenv
    from hyperbrowser import Hyperbrowser
    from hyperbrowser.models import (
        StartWebCrawlJobParams,
        WebCrawlOptions,
        FetchOutputOptions,
        FetchOutputJson,
    )
    from pydantic import BaseModel

    load_dotenv()

    client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


    class PageData(BaseModel):
        heading: str
        description: str


    result = client.web.crawl.start_and_wait(
        StartWebCrawlJobParams(
            url="https://example.com",
            outputs=FetchOutputOptions(
                formats=[
                    "markdown",
                    FetchOutputJson(type="json", schema=PageData),
                ]
            ),
            crawl_options=WebCrawlOptions(max_pages=5),
        )
    )

    for page in result.data:
        print(page.url, page.json_)
    ```
  </CodeGroup>
</Accordion>

For the full list of output formats and options (screenshots, sanitization, selectors, storage state), see the [Fetch outputs documentation](/web/fetch#outputs).

***

## Output Controls

Control what gets extracted from each crawled page. These work the same as [Fetch output controls](/web/fetch#output-controls):

| Field                      | Type       | Default  | Description                                                |
| -------------------------- | ---------- | -------- | ---------------------------------------------------------- |
| `outputs.sanitize`         | `string`   | `"none"` | Sanitize mode: `"none"`, `"basic"`, or `"advanced"`        |
| `outputs.includeSelectors` | `string[]` | `[]`     | CSS selectors to include (only matching elements returned) |
| `outputs.excludeSelectors` | `string[]` | `[]`     | CSS selectors to exclude from output                       |
| `outputs.storageState`     | `object`   | —        | Pre-seed localStorage/sessionStorage before fetching       |

***

## Browser & Stealth

Configure how the cloud browser runs. These options apply to all pages in the crawl:

| Field                   | Type      | Default                        | Description                                                                                      |
| ----------------------- | --------- | ------------------------------ | ------------------------------------------------------------------------------------------------ |
| `stealth`               | `string`  | `"auto"`                       | Stealth mode: `"none"`, `"auto"`, or `"ultra"`                                                   |
| `browser.profileId`     | `string`  | —                              | Reuse an existing browser profile                                                                |
| `browser.solveCaptchas` | `boolean` | `false`                        | Enable CAPTCHA solving                                                                           |
| `browser.screen`        | `object`  | `{ width: 1280, height: 720 }` | Set viewport dimensions (`width`, `height`)                                                      |
| `browser.location`      | `object`  | —                              | Localize via proxy location (`country`, `state`, `city`). If set, proxy is enabled automatically |

## Navigation Controls

Control page load behavior and timing for each crawled page:

| Field                  | Type     | Default              | Description                                                                         |
| ---------------------- | -------- | -------------------- | ----------------------------------------------------------------------------------- |
| `navigation.waitUntil` | `string` | `"domcontentloaded"` | Load condition: `"load"`, `"domcontentloaded"`, or `"networkidle"`                  |
| `navigation.waitFor`   | `number` | `0`                  | Milliseconds to wait after navigation completes before collecting outputs (0–30000) |
| `navigation.timeoutMs` | `number` | `30000`              | Max time (ms) to wait for navigation (1–60000)                                      |

## Cache Controls

Control caching behavior for crawl results:

| Field                 | Type     | Default | Description                                                                                         |
| --------------------- | -------- | ------- | --------------------------------------------------------------------------------------------------- |
| `cache.maxAgeSeconds` | `number` | —       | Cache control—cached results older than this are treated as stale. Set to `0` to bypass cache reads |

***

## Pagination

Crawl results are returned in batches. You can control pagination when retrieving results:

| Parameter   | Type     | Default | Description                      |
| ----------- | -------- | ------- | -------------------------------- |
| `page`      | `number` | `0`     | Page batch index to retrieve     |
| `batchSize` | `number` | `10`    | Number of page results per batch |

The response includes pagination metadata:

| Field              | Description                     |
| ------------------ | ------------------------------- |
| `totalPages`       | Total number of crawled pages   |
| `totalPageBatches` | Total number of result batches  |
| `currentPageBatch` | Current batch index             |
| `batchSize`        | Number of results in each batch |

<Info>
  When using `startAndWait` / `start_and_wait` with `returnAllPages` set to `true` (the default), the SDK automatically fetches all paginated results and combines them into a single response.
</Info>
