> ## Documentation Index
> Fetch the complete documentation index at: https://hyperbrowser.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Scrape

> Scrape any page and get formatted data

The Scrape API allows you to get the data you want from web pages with a single call. You can scrape page content and capture its data in various formats like markdown or html.

<Info>
  For detailed usage, checkout the [Scrape API Reference](/docs/api-reference/create-new-scrape-job).
</Info>

Hyperbrowser exposes endpoints for starting a scrape request and for getting its status and results. By default, scraping is handled in an asynchronous manner of first starting the job and then checking its status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.

## Installation

<CodeGroup>
  ```bash npm theme={null}
  npm install @hyperbrowser/sdk dotenv
  ```

  ```bash yarn theme={null}
  yarn add @hyperbrowser/sdk dotenv
  ```

  ```bash pip theme={null}
  pip install hyperbrowser python-dotenv
  ```

  ```bash uv theme={null}
  uv add hyperbrowser python-dotenv
  ```
</CodeGroup>

## Usage

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    // Handles both starting and waiting for scrape job response
    const scrapeResult = await client.scrape.startAndWait({
      url: "https://example.com",
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python 1.0+ theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      # Start scraping and wait for completion
      scrape_result = client.scrape.start_and_wait({"url": "https://example.com"})
      print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


  main()
  ```

  ```python Python (legacy) theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartScrapeJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      # Start scraping and wait for completion
      scrape_result = client.scrape.start_and_wait(
          StartScrapeJobParams(url="https://example.com")
      )
      print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


  main()
  ```

  ```bash cURL theme={null}
  # Start Scrape Job
  curl -X POST https://api.hyperbrowser.ai/api/scrape \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
          "url": "https://example.com"
      }'

  # Get Scrape Job Status
  curl https://api.hyperbrowser.ai/api/scrape/{jobId}/status \
      -H 'x-api-key: <YOUR_API_KEY>'

  # Get Scrape Job Status and Data
  curl https://api.hyperbrowser.ai/api/scrape/{jobId} \
      -H 'x-api-key: <YOUR_API_KEY>'
  ```
</CodeGroup>

## Response

The Start Scrape Job `POST /scrape` endpoint will return a `jobId` in the response which can be used to get information about the job in subsequent requests.

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
```

The Get Scrape Job Status `GET /scrape/{jobId}/status` will return the following data:

```json theme={null}
{
  "status": "completed"
}
```

The Get Scrape Job `GET /scrape/{jobId}` will return the following data:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": {
    "metadata": {
      "title": "Example Page",
      "description": "A sample webpage"
    },
    "markdown": "# Example Page\nThis is content..."
  }
}
```

The status of a scrape job can be one of `pending`, `running`, `completed`, `failed`. There can also be other optional fields like `error` with an error message if an error was encountered, and `html` and `links` in the data object depending on which formats are requested for the request.

To see the full schema, checkout the [API Reference](/docs/api-reference/create-new-scrape-job).

## Session Configurations

You can also provide configurations for the session that will be used to execute the scrape job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the [API Reference](/docs/api-reference/create-new-scrape-job#body-session-options) or [Session Parameters](/docs/sessions/parameters).

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const scrapeResult = await client.scrape.startAndWait({
      url: "https://example.com",
      sessionOptions: {
        useProxy: true,
        solveCaptchas: true,
        proxyCountry: "US",
      },
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python 1.0+ theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      scrape_result = client.scrape.start_and_wait(
          {
              "url": "https://example.com",
              "session_options": {"use_proxy": True, "solve_captchas": True},
          }
      )
      print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


  main()
  ```

  ```python Python (legacy) theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartScrapeJobParams, CreateSessionParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      scrape_result = client.scrape.start_and_wait(
          StartScrapeJobParams(
              url="https://example.com",
              session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
          )
      )
      print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


  main()
  ```
</CodeGroup>

<Warning>
  Proxy Usage and CAPTCHA solving are only available on `PAID` plans.

  Using proxy and solving CAPTCHAs will slow down the scrape so use it if necessary.
</Warning>

## Scrape Configurations

You can also provide optional parameters for the scrape job itself such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const scrapeResult = await client.scrape.startAndWait({
      url: "https://example.com",
      scrapeOptions: {
        formats: ["markdown", "html", "links"],
        onlyMainContent: false,
        timeout: 15000,
      },
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python 1.0+ theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  # Start scraping and wait for completion
  scrape_result = client.scrape.start_and_wait(
      {
          "url": "https://example.com",
          "scrape_options": {
              "formats": ["html", "links", "markdown"],
              "only_main_content": False,
              "timeout": 5000,
          },
      }
  )
  print("Scrape result:", scrape_result)
  ```

  ```python Python (legacy) theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import ScrapeOptions, StartScrapeJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  # Start scraping and wait for completion
  scrape_result = client.scrape.start_and_wait(
      StartScrapeJobParams(
          url="https://example.com",
          scrape_options=ScrapeOptions(
              formats=["html", "links", "markdown"], only_main_content=False, timeout=5000
          ),
      )
  )
  print("Scrape result:", scrape_result)
  ```
</CodeGroup>

For a full reference on the scrape endpoint, checkout the [API Reference](/docs/api-reference/create-new-scrape-job).

## Batch Scrape

Batch Scrape works the same as regular scrape, except instead of a single URL, you can provide a list of up to 1,000 URLs to scrape at once.

<Warning>Batch Scrape is currently only available on the `Scale` plan or higher.</Warning>

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const scrapeResult = await client.scrape.batch.startAndWait({
      urls: ["https://example.com", "https://hyperbrowser.ai"],
      scrapeOptions: {
        formats: ["markdown", "html", "links"],
      },
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python 1.0+ theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  scrape_result = client.scrape.batch.start_and_wait(
      {
          "urls": ["https://example.com", "https://hyperbrowser.ai"],
          "scrape_options": {"formats": ["html", "links", "markdown"]},
      }
  )
  print("Scrape result:", scrape_result)
  ```

  ```python Python (legacy) theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import ScrapeOptions, StartBatchScrapeJobParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  scrape_result = client.scrape.batch.start_and_wait(
      StartBatchScrapeJobParams(
          urls=["https://example.com", "https://hyperbrowser.ai"],
          scrape_options=ScrapeOptions(formats=["html", "links", "markdown"]),
      )
  )
  print("Scrape result:", scrape_result)
  ```
</CodeGroup>

### Response

The Start Batch Scrape Job `POST /scrape/batch` endpoint will return a `jobId` in the response which can be used to get information about the job in subsequent requests.

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
```

The Get Batch Scrape Job Status `GET /scrape/batch/{jobId}/status` will return the following data:

```json theme={null}
{
  "status": "completed"
}
```

The Get Batch Scrape Job `GET /scrape/batch/{jobId}` will return the following data:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "totalScrapedPages": 2,
  "totalPageBatches": 1,
  "currentPageBatch": 1,
  "batchSize": 20,
  "data": [
    {
      "markdown": "Hyperbrowser\n\n[Home](https://hyperbrowser.ai/)...",
      "metadata": {
        "url": "https://www.hyperbrowser.ai/",
        "title": "Hyperbrowser",
        "viewport": "width=device-width, initial-scale=1",
        "link:icon": "https://www.hyperbrowser.ai/favicon.ico",
        "sourceURL": "https://hyperbrowser.ai",
        "description": "Infinite Browsers"
      },
      "url": "hyperbrowser.ai",
      "status": "completed",
      "error": null
    },
    {
      "markdown": "Example Domain\n\n# Example Domain...",
      "metadata": {
        "url": "https://www.example.com/",
        "title": "Example Domain",
        "viewport": "width=device-width, initial-scale=1",
        "sourceURL": "https://example.com"
      },
      "url": "example.com",
      "status": "completed",
      "error": null
    }
  ]
}
```

<Info>
  Hyperbrowser's CAPTCHA solving and proxy usage features require being on a `PAID` plan.
</Info>

The status of a batch scrape job can be one of `pending`, `running`, `completed`, `failed`. The results of all the scrapes will be an array in the `data` field of the response. Each scraped page will be returned in the order of the initial provided urls, and each one will have its own status and information.

To see the full schema, checkout the [API Reference](/docs/api-reference/start-a-batch-scrape-job).

As with the single scrape, by default, batch scraping is handled in an asynchronous manner of first starting the job and then checking its status until it is completed. However, with our SDKs, we provide a simple function (`client.scrape.batch.startAndWait`) that handles the whole flow and returns the data once the job is completed.