Crawl and Extract URLs
This guide walks you through crawling web pages and extracting structured data from them using Sterndesk’s Crawl Collector.
Prerequisites
Before you begin, ensure you have:
For this guide, we’ll assume you have a project ID available. We’ll use proj_xyz789 as an example.
An extraction schema defines the structure of data you want to extract from web pages. Create a schema that matches the information you’re looking for.
For this example, we’ll create a schema to extract article information from web pages:
curl -X POST https://api.eu.sterndesk.com/r/extraction-schemas \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"project_id": "proj_xyz789",
"name": "Article Extraction",
"json_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"The title of the article\"},\"author\":{\"type\":\"string\",\"description\":\"The author of the article\"},\"published_date\":{\"type\":\"string\",\"description\":\"The publication date\"},\"summary\":{\"type\":\"string\",\"description\":\"A brief summary of the article content\"}},\"required\":[\"title\"]}"
}'
The json_schema field must be a JSON-encoded string, not a nested object. See Extraction Schemas for details on schema encoding.
Response:
{
"id": "exsc_abc123",
"name": "Article Extraction"
}
Save the schema ID (exsc_abc123) for the next step.
Step 2: Create a Crawl Collector
A Crawl Collector is a collector that fetches and processes web pages. When you attach an extraction schema to it, crawled pages are automatically extracted upon completion (Direct Extraction mode).
curl -X POST https://api.eu.sterndesk.com/r/crawl-collectors \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"project_id": "proj_xyz789",
"name": "News Articles",
"strategy": "CRAWL_STRATEGY_FIRECRAWL",
"direct_extraction_schema_id": "exsc_abc123"
}'
The strategy field specifies the crawling engine. Currently, CRAWL_STRATEGY_FIRECRAWL is the only supported strategy.
To enable long-term storage of crawled content, add "bundle_enabled": true to your request. This stores the crawled data in S3 as a bundle that can be downloaded later. See Bundled Extraction for details.
Response:
{
"id": "crw_coll_def456",
"name": "News Articles"
}
Save the collector ID (crw_coll_def456) for creating crawls.
Step 3: Create a Crawl
To crawl a web page, create a crawl request specifying the URL you want to process:
curl -X POST https://api.eu.sterndesk.com/r/crawls \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"crawl_collector_id": "crw_coll_def456",
"url": "https://example.com/article/12345"
}'
Response:
The crawl starts processing immediately in the background.
Step 4: Poll for Results
Development in Progress: The Crawl implementation is still being developed. While crawls will transition to CRAWL_STATUS_CRAWLED, extraction processing is not yet fully implemented. No extraction results will be returned at this time.
Poll the crawls endpoint to check the status of your crawl:
curl "https://api.eu.sterndesk.com/r/crawls?crawl_collector_id=crw_coll_def456" \
-H "Authorization: Bearer YOUR_API_KEY"
Response:
{
"items": [
{
"id": "crw_ghi789",
"status": "CRAWL_STATUS_CREATED",
"url": "https://example.com/article/12345"
}
]
}
Deleting a Crawl Collector
When you no longer need a crawl collector, you can delete it.
Deleting a crawl collector permanently removes the collector and all associated crawls and extractions. This action cannot be undone.
curl -X DELETE https://api.eu.sterndesk.com/r/crawl-collectors/crw_coll_def456 \
-H "Authorization: Bearer YOUR_API_KEY"
Next Steps