For the full API schema, see the Crawl API Reference.
Quick Start
Crawling is asynchronous—start the job and then poll for results. The SDKs provide a
startAndWait / start_and_wait convenience method that handles polling for you.Response
Starting a crawl job returns ajobId:
data:
status field on the job can be pending, running, completed, or failed. Each page in data also has its own status and may include an error field if that page failed.
Crawl Options
Control how the crawler traverses the site withcrawlOptions:
| Field | Type | Default | Description |
|---|---|---|---|
crawlOptions.maxPages | number | 10 | Maximum number of pages to crawl (max: 100) |
crawlOptions.followLinks | boolean | true | Whether to follow links found on crawled pages |
crawlOptions.ignoreSitemap | boolean | false | Whether to ignore the site’s sitemap |
crawlOptions.includePatterns | string[] | [] | URL patterns to include (only matching URLs are crawled) |
crawlOptions.excludePatterns | string[] | [] | URL patterns to exclude from crawling |
Example with URL filters
Example with URL filters
Outputs
Useoutputs.formats to control what data is returned for each crawled page. This works the same as Fetch outputs—you can request markdown, HTML, links, screenshots, or structured JSON.
Structured JSON extraction per page
Structured JSON extraction per page
Pass a JSON Schema to extract structured data from each crawled page. You can use a raw JSON Schema object, a Zod schema (Node), or a Pydantic model (Python).
Output Controls
Control what gets extracted from each crawled page. These work the same as Fetch output controls:| Field | Type | Default | Description |
|---|---|---|---|
outputs.sanitize | string | "none" | Sanitize mode: "none", "basic", or "advanced" |
outputs.includeSelectors | string[] | [] | CSS selectors to include (only matching elements returned) |
outputs.excludeSelectors | string[] | [] | CSS selectors to exclude from output |
outputs.storageState | object | — | Pre-seed localStorage/sessionStorage before fetching |
Browser & Stealth
Configure how the cloud browser runs. These options apply to all pages in the crawl:| Field | Type | Default | Description |
|---|---|---|---|
stealth | string | "auto" | Stealth mode: "none", "auto", or "ultra" |
browser.profileId | string | — | Reuse an existing browser profile |
browser.solveCaptchas | boolean | false | Enable CAPTCHA solving |
browser.screen | object | { width: 1280, height: 720 } | Set viewport dimensions (width, height) |
browser.location | object | — | Localize via proxy location (country, state, city). If set, proxy is enabled automatically |
Navigation Controls
Control page load behavior and timing for each crawled page:| Field | Type | Default | Description |
|---|---|---|---|
navigation.waitUntil | string | "domcontentloaded" | Load condition: "load", "domcontentloaded", or "networkidle" |
navigation.waitFor | number | 0 | Milliseconds to wait after navigation completes before collecting outputs (0–30000) |
navigation.timeoutMs | number | 30000 | Max time (ms) to wait for navigation (1–60000) |
Cache Controls
Control caching behavior for crawl results:| Field | Type | Default | Description |
|---|---|---|---|
cache.maxAgeSeconds | number | — | Cache control—cached results older than this are treated as stale. Set to 0 to bypass cache reads |
Pagination
Crawl results are returned in batches. You can control pagination when retrieving results:| Parameter | Type | Default | Description |
|---|---|---|---|
page | number | 0 | Page batch index to retrieve |
batchSize | number | 10 | Number of page results per batch |
| Field | Description |
|---|---|
totalPages | Total number of crawled pages |
totalPageBatches | Total number of result batches |
currentPageBatch | Current batch index |
batchSize | Number of results in each batch |
When using
startAndWait / start_and_wait with returnAllPages set to true (the default), the SDK automatically fetches all paginated results and combines them into a single response.