> ## Documentation Index
> Fetch the complete documentation index at: https://hyperbrowser.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Crawl

> Crawl websites and get formatted data from multiple pages

The Crawl API allows you to crawl websites and get data from multiple pages in a single request. Starting from a URL, it can navigate through the site and extract content from linked pages.

<Info>
  For detailed usage, checkout the [Crawl API Reference](/api-reference/start-a-crawl-job).
</Info>

Hyperbrowser exposes endpoints for starting a crawl request and for getting its status and results. By default, crawling is handled in an asynchronous manner of first starting the job and then checking its status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.

## Installation

<CodeGroup>
  ```bash npm theme={null}
  npm install @hyperbrowser/sdk dotenv
  ```

  ```bash yarn theme={null}
  yarn add @hyperbrowser/sdk dotenv
  ```

  ```bash pip theme={null}
  pip install hyperbrowser python-dotenv
  ```

  ```bash uv theme={null}
  uv add hyperbrowser python-dotenv
  ```
</CodeGroup>

## Usage

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    // Handles both starting and waiting for crawl job response
    const crawlResult = await client.crawl.startAndWait({
      url: "https://example.com",
      maxPages: 10,
      followLinks: true,
    });
    console.log("Crawl result:", crawlResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartCrawlJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      # Start crawling and wait for completion
      crawl_result = client.crawl.start_and_wait(
          StartCrawlJobParams(
              url="https://example.com",
              max_pages=10,
              follow_links=True
          )
      )
      print("Crawl result:\n", crawl_result.model_dump_json(indent=2))


  main()
  ```

  ```bash cURL theme={null}
  # Start Crawl Job
  curl -X POST https://api.hyperbrowser.ai/api/crawl \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
          "url": "https://example.com",
          "maxPages": 10,
          "followLinks": true
      }'

  # Get Crawl Job Status
  curl https://api.hyperbrowser.ai/api/crawl/{jobId}/status \
      -H 'x-api-key: <YOUR_API_KEY>'

  # Get Crawl Job Status and Data
  curl https://api.hyperbrowser.ai/api/crawl/{jobId} \
      -H 'x-api-key: <YOUR_API_KEY>'
  ```
</CodeGroup>

## Response

The Start Crawl Job `POST /crawl` endpoint will return a `jobId` in the response which can be used to get information about the job in subsequent requests.

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
```

The Get Crawl Job Status `GET /crawl/{jobId}/status` will return the following data:

```json theme={null}
{
  "status": "completed"
}
```

The Get Crawl Job `GET /crawl/{jobId}` will return the following data:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "totalCrawledPages": 10,
  "data": [
    {
      "metadata": {
        "title": "Example Page",
        "description": "A sample webpage",
        "url": "https://example.com"
      },
      "markdown": "# Example Page\nThis is content..."
    }
  ]
}
```

The status of a crawl job can be one of `pending`, `running`, `completed`, `failed`. The results will be an array of scraped pages in the `data` field.

<Warning>
  Each crawled page has it's own status of completed or failed and can have it's own error field, so be cautious of that.
</Warning>

To see the full schema, checkout the [API Reference](/api-reference/start-a-crawl-job).

## Crawl Options

You can configure various options for the crawl job:

* **maxPages**: Maximum number of pages to crawl (default: 10, max: 100)
* **followLinks**: Whether to follow links on the crawled pages (default: true)
* **ignoreSitemap**: Whether to ignore the sitemap (default: false)

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const crawlResult = await client.crawl.startAndWait({
      url: "https://example.com",
      maxPages: 50,
      followLinks: true,
      ignoreSitemap: false,
    });
    console.log("Crawl result:", crawlResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartCrawlJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  # Start crawling and wait for completion
  crawl_result = client.crawl.start_and_wait(
      StartCrawlJobParams(
          url="https://example.com",
          max_pages=50,
          follow_links=True,
          ignore_sitemap=False
      )
  )
  print("Crawl result:\n", crawl_result.model_dump_json(indent=2))


  main()
  ```
</CodeGroup>

## Session Configurations

You can also provide configurations for the session that will be used to execute the crawl job, such as using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the [API Reference](/api-reference/start-a-crawl-job#body-session-options) or [Session Parameters](/sessions/parameters).

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const crawlResult = await client.crawl.startAndWait({
      url: "https://example.com",
      maxPages: 10,
      followLinks: true,
      sessionOptions: {
        useProxy: true,
        solveCaptchas: true,
        proxyCountry: "US",
      },
    });
    console.log("Crawl result:", crawlResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartCrawlJobParams, CreateSessionParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  crawl_result = client.crawl.start_and_wait(
      StartCrawlJobParams(
          url="https://example.com",
          max_pages=10,
          follow_links=True,
          session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
      )
  )
  print("Crawl result:", crawl_result)
  ```
</CodeGroup>

<Warning>
  Using proxy and solving CAPTCHAs will slow down the crawl so use it only if
  necessary.
</Warning>

## Scrape Configurations

You can also provide optional scrape options for the crawl job such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const crawlResult = await client.crawl.startAndWait({
      url: "https://example.com",
      scrapeOptions: {
        formats: ["markdown", "html", "links"],
        onlyMainContent: false,
        timeout: 10000,
      },
    });
    console.log("Crawl result:", crawlResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import ScrapeOptions, StartCrawlJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  # Start crawling and wait for completion
  crawl_result = client.crawl.start_and_wait(
      StartCrawlJobParams(
          url="https://example.com",
          scrape_options=ScrapeOptions(
              formats=["html", "links", "markdown"], only_main_content=False, timeout=10000
          ),
      )
  )
  print("Crawl result:", crawl_result)
  ```
</CodeGroup>

<Info>
  Hyperbrowser's CAPTCHA solving and proxy usage features require being on a `PAID` plan.
</Info>

For a full reference on the crawl endpoint, checkout the [API Reference](/api-reference/start-a-crawl-job).
