> ## Documentation Index
> Fetch the complete documentation index at: https://hyperbrowser.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Advanced Guide

> End-to-end guide to scrape, crawl, and extract structured data

This guide shows how to use Hyperbrowser to scrape a single page, crawl multiple pages, and extract structured data. It also documents the most important parameters.

<Info>
  You can also see dedicated pages for [Scrape](/web-scraping/scrape), [Crawl](/web-scraping/crawl), and [Extract](/web-scraping/extract).
  For session configuration details, see [Configuration Parameters](/sessions/parameters).
  For full schemas, see the API Reference.
</Info>

## Scraping a web page

With just a URL, you can extract page contents in your chosen formats using the `/scrape` endpoint.

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    // Handles both starting and waiting for scrape job response
    const scrapeResult = await client.scrape.startAndWait({
      url: "https://example.com",
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartScrapeJobParams

  load_dotenv()

  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  # Start scraping and wait for completion
  scrape_result = client.scrape.start_and_wait(
      StartScrapeJobParams(url="https://example.com")
  )
  print("Scrape result:\n", scrape_result.model_dump_json(indent=2))
  ```

  ```bash cURL theme={null}
  # Start Scrape Job
  curl -X POST https://api.hyperbrowser.ai/api/scrape \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
          "url": "https://example.com"
      }'

  # Get Scrape Job Status
  curl https://api.hyperbrowser.ai/api/scrape/{jobId}/status \
      -H 'x-api-key: <YOUR_API_KEY>'

  # Get Scrape Job Status and Data
  curl https://api.hyperbrowser.ai/api/scrape/{jobId} \
      -H 'x-api-key: <YOUR_API_KEY>'
  ```
</CodeGroup>

### Session Options

All Scraping APIs (scrape, crawl, extract) support session parameters. See [Session Parameters](/sessions/parameters) for all options.

### Scrape Options

<ParamField path="formats" type="string[]" default="[&#x22;markdown&#x22;]">
  Output formats to include in the response. One or more of: `"html"`, `"links"`, `"markdown"`, `"screenshot"`.
</ParamField>

<ParamField path="includeTags" type="string[]">
  CSS selectors (tags, classes, IDs) to explicitly include. Only matching elements are returned.
</ParamField>

<ParamField path="excludeTags" type="string[]">
  CSS selectors (tags, classes, IDs) to exclude from the scraped content.
</ParamField>

<ParamField path="onlyMainContent" type="boolean" default="true">
  When `true`, attempts to extract only main content (omits headers/nav/footers).
</ParamField>

<ParamField path="waitFor" type="number" default="0">
  Milliseconds to wait after initial load before scraping (useful for dynamic content and CAPTCHA detection when `sessionOptions.solveCaptchas` is enabled).
</ParamField>

<ParamField path="timeout" type="number" default="30000">
  Maximum time (ms) to wait for navigation to complete. Equivalent to `page.goto(url, { waitUntil: "load", timeout })`.
</ParamField>

<ParamField path="waitUntil" type="string" default="load">
  Load condition: `"load"`, `"domcontentloaded"`, or `"networkidle"`.
</ParamField>

<ParamField path="screenshotOptions" type="object">
  Screenshot settings (effective only when `formats` includes `"screenshot"`). Both `fullPage` and `cropToContent` cannot be true at the same time.

  * `fullPage` (`boolean`, default `false`) — capture full page beyond viewport
  * `format` (`"webp" | "jpeg" | "png"`, default `"webp"`)
  * `cropToContent` (`boolean`, default `false`) — Automatically adjusts the screenshot height to match the page's actual content. If the page is shorter than the viewport, the screenshot is trimmed to remove any empty space below the content. If the page is taller than the viewport, the screenshot is cropped to the height of the viewport.
  * `cropToContentMaxHeight` (`number`, optional) — The maximum height of the screenshot when `cropToContent` is true. Overrides the height set in the `screen` configuration.
  * `cropToContentMinHeight` (`number`, optional) — The minimum height of the screenshot when `cropToContent` is true. Overrides the height set in the `screen` configuration.
</ParamField>

<ParamField path="storageState" type="object">
  Set the storage state of the page before scraping.
  Properties:

  * `localStorage` (`object`, optional) — Local storage data (key-value pairs where both keys and values must be strings)
  * `sessionStorage` (`object`, optional) — Session storage data (key-value pairs where both keys and values must be strings)
</ParamField>

#### Example with options

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.

For example, to scrape a page with the following:

* In stealth mode

* Automatically accept cookies

* Return only the main content as HTML

* Exclude any `<span>` elements

* Wait 2 seconds after the page loads and before scraping

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const scrapeResult = await client.scrape.startAndWait({
      url: "https://example.com",
      sessionOptions: {
        useStealth: true,
        acceptCookies: true,
      },
      scrapeOptions: {
        formats: ["html"],
        onlyMainContent: true,
        excludeTags: ["span"],
        waitFor: 2000,
      },
    });
    console.log("Scrape result:", scrapeResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartScrapeJobParams, CreateSessionParams, ScrapeOptions


  load_dotenv()


  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  scrape_result = client.scrape.start_and_wait(
      StartScrapeJobParams(
          url="https://example.com",
          session_options=CreateSessionParams(use_stealth=True, accept_cookies=True),
          scrape_options=ScrapeOptions(
              formats=["html"],
              only_main_content=True,
              exclude_tags=["span"],
              wait_for=2000,
          ),
      )
  )

  print("Scrape result:\n", scrape_result.model_dump_json(indent=2))
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.hyperbrowser.ai/api/scrape \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
              "url": "https://example.com",
              "sessionOptions": {
                      "useStealth": true,
                      "acceptCookies": true
              },
              "scrapeOptions": {
                      "formats": ["html"],
                      "onlyMainContent": true,
                      "excludeTags": ["span"],
                      "waitFor": 2000
              }
      }'
  ```
</CodeGroup>

## Crawl a site

Instead of scraping a single page, you can collect content across multiple pages using the `/crawl` endpoint. You can use the same `sessionOptions` and `scrapeOptions` as in `/scrape`, along with additional crawl-specific options below.

### Crawl Options

<ParamField path="url" type="string" required>
  The URL of the page to crawl.
</ParamField>

<ParamField path="maxPages" type="number">
  Maximum number of pages to crawl before stopping (minimum: 1).
</ParamField>

<ParamField path="followLinks" type="boolean" default="true">
  When `true`, follow links discovered on pages to expand the crawl.
</ParamField>

<ParamField path="ignoreSitemap" type="boolean" default="false">
  When `true`, skip pre-generating URLs from sitemaps at the target origin.
</ParamField>

<ParamField path="excludePatterns" type="string[]">
  Regex or wildcard patterns for URL paths to exclude from the crawl.
</ParamField>

<ParamField path="includePatterns" type="string[]">
  Regex or wildcard patterns for URL paths to include (only matching pages will be crawled).
</ParamField>

<ParamField path="sessionOptions" type="object">
  Session configuration used during the crawl. See [Session Parameters](/sessions/parameters).
</ParamField>

<ParamField path="scrapeOptions" type="object">
  Scrape options used during the crawl. See [Scrape Options](#scrape-options).
</ParamField>

#### Example with options

<CodeGroup>
  ```typescript Node theme={null}
  import { Hyperbrowser } from "@hyperbrowser/sdk";
  import { config } from "dotenv";

  config();

  const client = new Hyperbrowser({
    apiKey: process.env.HYPERBROWSER_API_KEY,
  });

  const main = async () => {
    const crawlResult = await client.crawl.startAndWait({
      url: "https://hyperbrowser.ai",
      maxPages: 5,
      includePatterns: ["/blog/*"],
      scrapeOptions: {
        formats: ["markdown"],
        onlyMainContent: true,
        excludeTags: ["span"],
      },
    });
    console.log("Crawl result:", crawlResult);
  };

  main();
  ```

  ```python Python theme={null}
  import os
  from dotenv import load_dotenv
  from hyperbrowser import Hyperbrowser
  from hyperbrowser.models import StartCrawlJobParams, ScrapeOptions


  load_dotenv()


  client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


  crawl_result = client.crawl.start_and_wait(
      StartCrawlJobParams(
          url="https://hyperbrowser.ai",
          max_pages=5,
          include_patterns=["/blog/*"],
          scrape_options=ScrapeOptions(
              formats=["markdown"],
              only_main_content=True,
              exclude_tags=["span"],
          ),
      )
  )

  print("Crawl result:\n", crawl_result.model_dump_json(indent=2))
  ```

  ```bash cURL theme={null}
  curl -X POST https://api.hyperbrowser.ai/api/crawl \
      -H 'Content-Type: application/json' \
      -H 'x-api-key: <YOUR_API_KEY>' \
      -d '{
              "url": "https://hyperbrowser.ai",
              "maxPages": 5,
              "includePatterns": ["/blog/*"],
              "scrapeOptions": {
                      "formats": ["markdown"],
                      "onlyMainContent": true,
                      "excludeTags": ["span"]
              }
      }'
  ```
</CodeGroup>

## Structured extraction

The Extract API fetches data in a well-defined structure from any set of pages. Provide a list of URLs, and Hyperbrowser will collect relevant content (including optional crawling) and return data that fits your schema or prompt.

### Extract Options

<ParamField path="urls" type="string[]" required>
  List of page URLs. To crawl an origin for a URL, append `/*` (e.g., `https://example.com/*`) to follow relevant links up to `maxLinks`.
</ParamField>

<ParamField path="schema" type="object">
  JSON Schema for the desired output.
</ParamField>

<ParamField path="prompt" type="string">
  Instructional prompt describing how to structure the extracted data. If no `schema` is provided, we will try to generate a schema based on the prompt.
</ParamField>

<ParamField path="systemPrompt" type="string">
  Additional instructions to guide extraction behavior.
</ParamField>

<ParamField path="maxLinks" type="number">
  When crawling for any given `/*` URL, the maximum number of links to follow.
</ParamField>

<ParamField path="waitFor" type="number" default="0">
  Milliseconds to wait after page load before extraction (useful for dynamic content and CAPTCHA detection when `sessionOptions.solveCaptchas` is enabled).
</ParamField>

<ParamField path="sessionOptions" type="object">
  Session configuration used during extraction. See [Session Parameters](/sessions/parameters).
</ParamField>

<Info>
  You can provide a **schema**, or a **prompt**, or both. For best results, provide both a **schema** and a **prompt**. The **schema** should define exactly how you want the extracted data formatted, and the **prompt** should include any information that can help guide the extraction. If no **schema** is provided, we will try to automatically generate a **schema** based on the **prompt**.
</Info>
