Extraction Schemas
Extraction schemas define the structure and format of data you want to extract from documents. They tell Sterndesk exactly what information to look for and how to organize it in the output. Schemas are typically designed for a specific type of document—a schema for extracting data from vessel reports will look very different from one for research papers or product pages.What Is an Extraction Schema?
An extraction schema is a blueprint that describes the shape of the data you want to extract. When you process a document, Sterndesk uses the schema to:- Guide extraction — The AI understands what fields to look for based on the schema
- Validate output — Ensures the extracted data conforms to your expected structure
- Standardize results — All extractions for a given schema produce consistent, predictable JSON
JSON Schema Standard
Sterndesk uses JSON Schema as the format for defining extraction schemas. JSON Schema is an open standard for describing and validating JSON data structures. Key features of JSON Schema that make it ideal for extraction:| Feature | Description |
|---|---|
| Type definitions | Specify whether a field is a string, number, boolean, array, or object |
| Required fields | Mark which fields must be present in the output |
| Enumerations | Restrict values to a predefined set of options |
| Nested structures | Define complex hierarchies with objects containing other objects |
| Descriptions | Add human-readable descriptions that help the AI understand each field |
Field descriptions are particularly important—they provide context that helps the AI accurately identify and extract the right information from documents.
Schemas and Collectors
Extraction schemas are linked to collectors. When you configure a collector, you associate it with an extraction schema. Every document processed by that collector will have its data extracted according to the schema.- Use the same schema across multiple collectors
- Process different document types with different schemas
- Update a schema once and have it apply to all associated collectors
Schema Encoding
When sending an extraction schema to the Sterndesk API, the JSON Schema must be encoded as a JSON string. This means the schema object needs to be serialized (stringified) before being included in your API request. Incorrect — schema as nested object:Summarization as Extraction
Summarization is a special form of extraction where instead of pulling out discrete fields, you ask the AI to generate a condensed representation of the document’s content. You can implement summarization by defining schema fields that expect synthesized content rather than verbatim extraction:When to Use Summarization Fields
- Long documents — Condense lengthy reports into digestible summaries
- Research papers — Extract methodology summaries and key conclusions
- Meeting notes — Generate action item lists from unstructured notes
- Compliance documents — Summarize findings and recommendations
Schema Examples
Simple Schema
A basic schema for extracting contact information from a document:Complex Schema
A more comprehensive schema for extracting maritime work experience from a resume, with nested objects, arrays, and enumerations:- Nested objects —
personal_infogroups related fields - Arrays of objects —
sea_experiencecaptures multiple work entries - Enumerations —
vessel_typeandrankrestrict values to valid options - Optional fields — Not all fields are required, allowing partial extraction
- Descriptive fields — Each property includes a description to guide extraction