Skip to main content

Crawl and Extract URLs

This guide walks you through crawling web pages and extracting structured data from them using Sterndesk’s Crawl Collector.

Prerequisites

Before you begin, ensure you have: For this guide, we’ll assume you have a project ID available. We’ll use proj_xyz789 as an example.

Step 1: Create an Extraction Schema

An extraction schema defines the structure of data you want to extract from web pages. Create a schema that matches the information you’re looking for. For this example, we’ll create a schema to extract article information from web pages:
curl -X POST https://api.eu.sterndesk.com/r/extraction-schemas \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "Article Extraction",
    "json_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"The title of the article\"},\"author\":{\"type\":\"string\",\"description\":\"The author of the article\"},\"published_date\":{\"type\":\"string\",\"description\":\"The publication date\"},\"summary\":{\"type\":\"string\",\"description\":\"A brief summary of the article content\"}},\"required\":[\"title\"]}"
  }'
The json_schema field must be a JSON-encoded string, not a nested object. See Extraction Schemas for details on schema encoding.
Response:
{
  "id": "exsc_abc123",
  "name": "Article Extraction"
}
Save the schema ID (exsc_abc123) for the next step.

Step 2: Create a Crawl Collector

A Crawl Collector is a collector that fetches and processes web pages. When you attach an extraction schema to it, crawled pages are automatically extracted upon completion (Direct Extraction mode).
curl -X POST https://api.eu.sterndesk.com/r/crawl-collectors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "project_id": "proj_xyz789",
    "name": "News Articles",
    "strategy": "CRAWL_STRATEGY_FIRECRAWL",
    "direct_extraction_schema_id": "exsc_abc123"
  }'
The strategy field specifies the crawling engine. Currently, CRAWL_STRATEGY_FIRECRAWL is the only supported strategy.
To enable long-term storage of crawled content, add "bundle_enabled": true to your request. This stores the crawled data in S3 as a bundle that can be downloaded later. See Bundled Extraction for details.
Response:
{
  "id": "crw_coll_def456",
  "name": "News Articles"
}
Save the collector ID (crw_coll_def456) for creating crawls.

Step 3: Create a Crawl

To crawl a web page, create a crawl request specifying the URL you want to process:
curl -X POST https://api.eu.sterndesk.com/r/crawls \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "crawl_collector_id": "crw_coll_def456",
    "url": "https://example.com/article/12345"
  }'
Response:
{
  "id": "crw_ghi789"
}
The crawl starts processing immediately in the background.

Step 4: Poll for Results

Development in Progress: The Crawl implementation is still being developed. While crawls will transition to CRAWL_STATUS_CRAWLED, extraction processing is not yet fully implemented. No extraction results will be returned at this time.
Poll the crawls endpoint to check the status of your crawl:
curl "https://api.eu.sterndesk.com/r/crawls?crawl_collector_id=crw_coll_def456" \
  -H "Authorization: Bearer YOUR_API_KEY"
Response:
{
  "items": [
    {
      "id": "crw_ghi789",
      "status": "CRAWL_STATUS_CREATED",
      "url": "https://example.com/article/12345"
    }
  ]
}

Deleting a Crawl Collector

When you no longer need a crawl collector, you can delete it.
Deleting a crawl collector permanently removes the collector and all associated crawls and extractions. This action cannot be undone.
curl -X DELETE https://api.eu.sterndesk.com/r/crawl-collectors/crw_coll_def456 \
  -H "Authorization: Bearer YOUR_API_KEY"

Next Steps