Crawl and Extract URLs

This guide walks you through crawling web pages and extracting structured data from them using Sterndesk’s Crawl Collector.

Prerequisites

Before you begin, ensure you have:

An API key with appropriate permissions (see Authentication)
An existing organization and project (see the guide)

For this guide, we’ll assume you have a project ID available. We’ll use proj_xyz789 as an example.

Step 1: Create an Extraction Schema

An extraction schema defines the structure of data you want to extract from web pages. Create a schema that matches the information you’re looking for. For this example, we’ll create a schema to extract article information from web pages:

curl -X POST https://api.eu.sterndesk.com/r/extraction-schemas \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "Article Extraction",
    "json_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"The title of the article\"},\"author\":{\"type\":\"string\",\"description\":\"The author of the article\"},\"published_date\":{\"type\":\"string\",\"description\":\"The publication date\"},\"summary\":{\"type\":\"string\",\"description\":\"A brief summary of the article content\"}},\"required\":[\"title\"]}"
  }'

The json_schema field must be a JSON-encoded string, not a nested object. See Extraction Schemas for details on schema encoding.

Response:

{
  "id": "exsc_abc123",
  "name": "Article Extraction"
}

Save the schema ID (exsc_abc123) for the next step.

Step 2: Create a Crawl Collector

A Crawl Collector is a collector that fetches and processes web pages. When you attach an extraction schema to it, crawled pages are automatically extracted upon completion (Direct Extraction mode).

curl -X POST https://api.eu.sterndesk.com/r/crawl-collectors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "News Articles",
    "strategy": "CRAWL_STRATEGY_FIRECRAWL",
    "direct_extraction_schema_id": "exsc_abc123"
  }'

The strategy field specifies the crawling engine. Currently, CRAWL_STRATEGY_FIRECRAWL is the only supported strategy.

To enable long-term storage of crawled content, add "bundle_enabled": true to your request. This stores the crawled data in S3 as a bundle that can be downloaded later. See Bundled Extraction for details.

Response:

{
  "id": "crw_coll_def456",
  "name": "News Articles"
}

Save the collector ID (crw_coll_def456) for creating crawls.

Step 3: Create a Crawl

To crawl a web page, create a crawl request specifying the URL you want to process:

curl -X POST https://api.eu.sterndesk.com/r/crawls \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "crawl_collector_id": "crw_coll_def456",
    "url": "https://example.com/article/12345"
  }'

Response:

{
  "id": "crw_ghi789"
}

The crawl starts processing immediately in the background.

Step 4: Poll for Results

Development in Progress: The Crawl implementation is still being developed. While crawls will transition to CRAWL_STATUS_CRAWLED, extraction processing is not yet fully implemented. No extraction results will be returned at this time.

Poll the crawls endpoint to check the status of your crawl:

curl "https://api.eu.sterndesk.com/r/crawls?crawl_collector_id=crw_coll_def456" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "items": [
    {
      "id": "crw_ghi789",
      "status": "CRAWL_STATUS_CREATED",
      "url": "https://example.com/article/12345"
    }
  ]
}

Deleting a Crawl Collector

When you no longer need a crawl collector, you can delete it.

Deleting a crawl collector permanently removes the collector and all associated crawls and extractions. This action cannot be undone.

curl -X DELETE https://api.eu.sterndesk.com/r/crawl-collectors/crw_coll_def456 \
  -H "Authorization: Bearer YOUR_API_KEY"

Get Started

Concepts

Guides

Crawl and Extract URLs

Crawl and Extract URLs

Prerequisites

Step 1: Create an Extraction Schema

Step 2: Create a Crawl Collector

Step 3: Create a Crawl

Step 4: Poll for Results

Deleting a Crawl Collector

Next Steps

Extraction Schemas

Upload Documents

Get Started

Concepts

Guides

​Crawl and Extract URLs

​Prerequisites

​Step 1: Create an Extraction Schema

​Step 2: Create a Crawl Collector

​Step 3: Create a Crawl

​Step 4: Poll for Results

​Deleting a Crawl Collector

​Next Steps

Extraction Schemas

Upload Documents

Crawl and Extract URLs

Prerequisites

Step 1: Create an Extraction Schema

Step 2: Create a Crawl Collector

Step 3: Create a Crawl

Step 4: Poll for Results

Deleting a Crawl Collector

Next Steps