An extraction schema defines the structure of data you want to extract from web pages. Create a schema that matches the information you’re looking for.For this example, we’ll create a schema to extract article information from web pages:
Copy
curl -X POST https://api.eu.sterndesk.com/r/extraction-schemas \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "project_id": "proj_xyz789", "name": "Article Extraction", "json_schema": "{\"type\":\"object\",\"properties\":{\"title\":{\"type\":\"string\",\"description\":\"The title of the article\"},\"author\":{\"type\":\"string\",\"description\":\"The author of the article\"},\"published_date\":{\"type\":\"string\",\"description\":\"The publication date\"},\"summary\":{\"type\":\"string\",\"description\":\"A brief summary of the article content\"}},\"required\":[\"title\"]}" }'
The json_schema field must be a JSON-encoded string, not a nested object. See Extraction Schemas for details on schema encoding.
A Crawl Collector is a collector that fetches and processes web pages. When you attach an extraction schema to it, crawled pages are automatically extracted upon completion (Direct Extraction mode).
The strategy field specifies the crawling engine. Currently, CRAWL_STRATEGY_FIRECRAWL is the only supported strategy.
To enable long-term storage of crawled content, add "bundle_enabled": true to your request. This stores the crawled data in S3 as a bundle that can be downloaded later. See Bundled Extraction for details.
Development in Progress: The Crawl implementation is still being developed. While crawls will transition to CRAWL_STATUS_CRAWLED, extraction processing is not yet fully implemented. No extraction results will be returned at this time.
Poll the crawls endpoint to check the status of your crawl: