Extraction Schemas

Extraction schemas define the structure and format of data you want to extract from documents. They tell Sterndesk exactly what information to look for and how to organize it in the output. Schemas are typically designed for a specific type of document—a schema for extracting data from vessel reports will look very different from one for research papers or product pages.

What Is an Extraction Schema?

An extraction schema is a blueprint that describes the shape of the data you want to extract. When you process a document, Sterndesk uses the schema to:

Guide extraction — The AI understands what fields to look for based on the schema
Validate output — Ensures the extracted data conforms to your expected structure
Standardize results — All extractions for a given schema produce consistent, predictable JSON

Without a schema, extraction results would be unstructured and inconsistent. Schemas give you control over the exact output format, making it easy to integrate extracted data into your applications and databases.

JSON Schema Standard

Sterndesk uses JSON Schema as the format for defining extraction schemas. JSON Schema is an open standard for describing and validating JSON data structures. Key features of JSON Schema that make it ideal for extraction:

Feature	Description
Type definitions	Specify whether a field is a string, number, boolean, array, or object
Required fields	Mark which fields must be present in the output
Enumerations	Restrict values to a predefined set of options
Nested structures	Define complex hierarchies with objects containing other objects
Descriptions	Add human-readable descriptions that help the AI understand each field

Field descriptions are particularly important—they provide context that helps the AI accurately identify and extract the right information from documents.

Schemas and Collectors

Extraction schemas are linked to collectors. When you configure a collector, you associate it with an extraction schema. Every document processed by that collector will have its data extracted according to the schema.

Collector → uses → Extraction Schema → produces → Extractions

This design allows you to:

Use the same schema across multiple collectors
Process different document types with different schemas
Update a schema once and have it apply to all associated collectors

Schema Encoding

When sending an extraction schema to the Sterndesk API, the JSON Schema must be encoded as a JSON string. This means the schema object needs to be serialized (stringified) before being included in your API request.

A common mistake is sending the schema as a nested JSON object. The API expects the schema as a string value, not an embedded object.

Incorrect — schema as nested object:

{
  "name": "contact-extraction",
  "json_schema": {
    "type": "object",
    "properties": { ... }
  }
}

Correct — schema as JSON string:

{
  "name": "contact-extraction",
  "json_schema": "{\"type\":\"object\",\"properties\":{...}}"
}

Summarization as Extraction

Summarization is a special form of extraction where instead of pulling out discrete fields, you ask the AI to generate a condensed representation of the document’s content. You can implement summarization by defining schema fields that expect synthesized content rather than verbatim extraction:

{
  "type": "object",
  "properties": {
    "executive_summary": {
      "type": "string",
      "description": "A 2-3 sentence summary of the document's main points and conclusions"
    },
    "key_findings": {
      "type": "array",
      "items": { "type": "string" },
      "description": "List of the 3-5 most important findings or takeaways"
    },
    "action_items": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Any recommended actions or next steps mentioned in the document"
    }
  }
}

This approach lets you combine summarization with structured extraction in a single schema—extract specific data points alongside generated summaries.

When to Use Summarization Fields

Long documents — Condense lengthy reports into digestible summaries
Research papers — Extract methodology summaries and key conclusions
Meeting notes — Generate action item lists from unstructured notes
Compliance documents — Summarize findings and recommendations

Schema Examples

Simple Schema

A basic schema for extracting contact information from a document:

{
  "type": "object",
  "properties": {
    "full_name": {
      "type": "string",
      "description": "The full name of the person"
    },
    "email": {
      "type": "string",
      "description": "Email address"
    },
    "phone": {
      "type": "string",
      "description": "Phone number in any format"
    },
    "company": {
      "type": "string",
      "description": "Company or organization name"
    }
  },
  "required": ["full_name"]
}

Complex Schema

A more comprehensive schema for extracting maritime work experience from a resume, with nested objects, arrays, and enumerations:

{
  "type": "object",
  "properties": {
    "personal_info": {
      "type": "object",
      "properties": {
        "first_name": {
          "type": "string",
          "description": "First name of the candidate"
        },
        "last_name": {
          "type": "string",
          "description": "Last name of the candidate"
        },
        "nationality": {
          "type": "string",
          "description": "ISO 3166-1 alpha-2 country code (e.g., 'US', 'NL', 'PH')"
        },
        "email_addresses": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Contact email addresses"
        }
      }
    },
    "sea_experience": {
      "type": "array",
      "description": "Work experience on vessels or offshore",
      "items": {
        "type": "object",
        "properties": {
          "vessel_name": {
            "type": "string",
            "description": "Name of the vessel"
          },
          "vessel_type": {
            "type": "string",
            "enum": ["bulk_carrier", "tanker", "container_ship", "offshore_supply", "dredger", "tug", "other"],
            "description": "Type of vessel"
          },
          "rank": {
            "type": "string",
            "enum": ["master", "chief_officer", "second_officer", "third_officer", "chief_engineer", "second_engineer", "bosun", "ab", "os"],
            "description": "Position held on the vessel"
          },
          "start_date": {
            "type": "object",
            "properties": {
              "year": { "type": "integer", "description": "Year (e.g., 2023)" },
              "month": { "type": "integer", "description": "Month (1-12)" }
            }
          },
          "end_date": {
            "type": "object",
            "properties": {
              "year": { "type": "integer", "description": "Year (e.g., 2024)" },
              "month": { "type": "integer", "description": "Month (1-12)" }
            }
          },
          "to_present": {
            "type": "boolean",
            "description": "True if this is the current position"
          },
          "responsibilities_summary": {
            "type": "string",
            "description": "Brief summary of duties and responsibilities"
          }
        }
      }
    },
    "certifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string", "description": "Certificate name" },
          "issuing_authority": { "type": "string", "description": "Organization that issued the certificate" },
          "expiry_date": { "type": "string", "description": "Expiration date if applicable" }
        }
      }
    }
  }
}

This schema demonstrates:

Nested objects — personal_info groups related fields
Arrays of objects — sea_experience captures multiple work entries
Enumerations — vessel_type and rank restrict values to valid options
Optional fields — Not all fields are required, allowing partial extraction
Descriptive fields — Each property includes a description to guide extraction

API Reference

For detailed information on creating and managing extraction schemas, see the API Reference.

Get Started

Concepts

Guides

Extraction Schemas

Extraction Schemas

What Is an Extraction Schema?

JSON Schema Standard

Schemas and Collectors

Schema Encoding

Summarization as Extraction

When to Use Summarization Fields

Schema Examples

Simple Schema

Complex Schema

API Reference

Next Steps

Collectors

Extractions

Get Started

Concepts

Guides

​Extraction Schemas

​What Is an Extraction Schema?

​JSON Schema Standard

​Schemas and Collectors

​Schema Encoding

​Summarization as Extraction

​When to Use Summarization Fields

​Schema Examples

​Simple Schema

​Complex Schema

​API Reference

​Next Steps

Collectors

Extractions

Extraction Schemas

What Is an Extraction Schema?

JSON Schema Standard

Schemas and Collectors

Schema Encoding

Summarization as Extraction

When to Use Summarization Fields

Schema Examples

Simple Schema

Complex Schema

API Reference

Next Steps