> ## Documentation Index
> Fetch the complete documentation index at: https://hyperbrowser.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Extract

> Extract structured data from web pages using AI

The Extract API allows you to extract structured data from web pages using AI. You can define a schema and prompt, and Hyperbrowser will extract the data matching your requirements.

<Info>
  For detailed usage, checkout the [Extract API Reference](/api-reference/start-an-extract-job).
</Info>

## Installation

<CodeGroup>
  ```bash npm theme={null}
  npm install @hyperbrowser/sdk dotenv
  ```

  ```bash yarn theme={null}
  yarn add @hyperbrowser/sdk dotenv
  ```

  ```bash pip theme={null}
  pip install hyperbrowser python-dotenv
  ```

  ```bash uv theme={null}
  uv add hyperbrowser python-dotenv
  ```
</CodeGroup>

## Usage

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const extractResult = await client.extract.startAndWait({
      urls: ["https://example.com"],
      prompt: "Extract the main heading and description from the page",
      schema: {
        type: "object",
        properties: {
          heading: { type: "string" },
          description: { type: "string" },
        },
        required: ["heading", "description"],
      },
    });
    console.log("Extract result:", extractResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartExtractJobParams

  # Load environment variables from .env file
  load_dotenv()

  # Initialize Hyperbrowser client
  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      # Start extraction and wait for completion
      extract_result = client.extract.start_and_wait(
          StartExtractJobParams(
              urls=["https://example.com"],
              prompt="Extract the main heading and description from the page",
              schema={
                  "type": "object",
                  "properties": {
                      "heading": {"type": "string"},
                      "description": {"type": "string"}
                  },
                  "required": ["heading", "description"]
              }
          )
      )
      print("Extract result:\n", extract_result.model_dump_json(indent=2))


  main()
  ```

  ```bash cURL theme={null}
  # Start Extract Job
  curl -X POST https://api.hyperbrowser.ai/api/extract \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
          "urls": ["https://example.com"],
          "prompt": "Extract the main heading and description from the page",
          "schema": {
              "type": "object",
              "properties": {
                  "heading": {"type": "string"},
                  "description": {"type": "string"}
              },
              "required": ["heading", "description"]
          }
      }'

  # Get Extract Job Status
  curl https://api.hyperbrowser.ai/api/extract/{jobId}/status \
      -H 'x-api-key: <YOUR_API_KEY>'

  # Get Extract Job Status and Data
  curl https://api.hyperbrowser.ai/api/extract/{jobId} \
      -H 'x-api-key: <YOUR_API_KEY>'
  ```
</CodeGroup>

## Response

The Start Extract Job `POST /extract` endpoint will return a `jobId` in the response which can be used to get information about the job in subsequent requests.

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
```

The Get Extract Job Status `GET /extract/{jobId}/status` will return the following data:

```json theme={null}
{
  "status": "completed"
}
```

The Get Extract Job `GET /extract/{jobId}` will return the following data:

```json theme={null}
{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": {
    "heading": "Example Domain",
    "description": "This domain is for use in documentation examples without needing permission. Avoid use in operations."
  }
}
```

The status of an extract job can be one of `pending`, `running`, `completed`, `failed`.

To see the full schema, checkout the [API Reference](/api-reference/start-an-extract-job).

## Schema Definition

You can define a JSON schema to specify the structure of the data you want to extract. The schema should follow the JSON Schema specification.

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const extractResult = await client.extract.startAndWait({
      urls: ["https://news.ycombinator.com"],
      prompt: "Extract all article titles and their URLs from the front page",
      schema: {
        type: "object",
        properties: {
          articles: {
            type: "array",
            items: {
              type: "object",
              properties: {
                title: { type: "string" },
                url: { type: "string" },
                score: { type: "number" },
              },
              required: ["title", "url"],
            },
          },
        },
        required: ["articles"],
      },
    });
    console.log("Extract result:", extractResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartExtractJobParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  def main():
      extract_result = client.extract.start_and_wait(
          StartExtractJobParams(
              urls=["https://news.ycombinator.com"],
              prompt="Extract all article titles and their URLs from the front page",
              schema={
                  "type": "object",
                  "properties": {
                      "articles": {
                          "type": "array",
                          "items": {
                              "type": "object",
                              "properties": {
                                  "title": {"type": "string"},
                                  "url": {"type": "string"},
                                  "score": {"type": "number"}
                              },
                              "required": ["title", "url"]
                          }
                      }
                  },
                  "required": ["articles"]
              }
          )
      )
      print("Extract result:\n", extract_result.model_dump_json(indent=2))


  main()
  ```
</CodeGroup>

<Tip>
  For best results, provide both a schema and a prompt. The schema should define exactly how you want the extract data formatted and the prompt should have any information that can help guide the extraction. If no schema is provided, then we will try to automatically generate a schema based on the prompt.
</Tip>

## Session Configurations

You can also provide configurations for the session that will be used to execute the extract job, such as using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the [API Reference](/api-reference/start-an-extract-job#body-session-options) or [Session Parameters](/sessions/parameters).

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const extractResult = await client.extract.startAndWait({
      urls: ["https://example.com"],
      prompt: "Extract the main heading and description",
      schema: {
        type: "object",
        properties: {
          heading: { type: "string" },
          description: { type: "string" },
        },
      },
      sessionOptions: {
        useProxy: true,
        solveCaptchas: true,
        proxyCountry: "US",
      },
    });
    console.log("Extract result:", extractResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartExtractJobParams, CreateSessionParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  extract_result = client.extract.start_and_wait(
      StartExtractJobParams(
          urls=["https://example.com"],
          prompt="Extract the main heading and description",
          schema={
              "type": "object",
              "properties": {
                  "heading": {"type": "string"},
                  "description": {"type": "string"}
              }
          },
          session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
      )
  )
  print("Extract result:", extract_result)
  ```
</CodeGroup>

<Info>
  Hyperbrowser's CAPTCHA solving and proxy usage features require being on a PAID plan.
</Info>

<Info>
  Using proxy and solving CAPTCHAs will slow down the page scraping in the extract job so use it only if necessary.
</Info>

For a full reference on the extract endpoint, checkout the [API Reference](/api-reference/start-an-extract-job).
