Heratio Help Center article. Category: Technical / Integration.

schema.org Dataset descriptor (Google Dataset Search)

Overview

The platform publishes a single schema.org/Dataset descriptor so that the general web search engines - Google Dataset Search and Bing in particular - can index your whole published collection AS A DATASET. With it, the collection can appear in dataset-search results, not only in the ordinary web index.

This is distinct from the DCAT catalogue (/data/catalog). The DCAT catalogue speaks to open-data portals (CKAN, the European Data Portal); the schema.org Dataset speaks the vocabulary the general web search engines crawl. Both describe the same offering; they target different audiences.

This descriptor is open data: no API key, read-only, published records only, and cross-origin (CORS) open.


The endpoints

GET /data/dataset.jsonld

Always returns the schema.org/Dataset as JSON-LD (application/ld+json). This is the URL to give a search engine.

GET /data/dataset

Content-negotiated:

Accept header You get
application/ld+json (or a bare request) the JSON-LD descriptor
text/html (a browser) a 303 redirect to the Open Data & APIs landing page (/open-data)

What the descriptor contains

A single schema.org/Dataset node describing the published collection:

  • name, description, url (the Open Data landing page) and a stable identifier.
  • license - Creative Commons Attribution 4.0 (CC-BY-4.0).
  • creator and publisher - your institution, taken from the platform name in settings.
  • keywords - the subject area (archives, cultural heritage, GLAM, linked open data, finding aids).
  • temporalCoverage - the date span of the collection (the earliest to the latest dated record), when dates are available.
  • spatialCoverage - the most-referenced places in the collection.
  • includedInDataCatalog - a link to the full DCAT catalogue (/data/catalog).
  • size - the number of published records.
  • distribution - one entry per downloadable form of the data (see below).

Distributions (the downloads)

Each distribution is a schema.org/DataDownload with an encodingFormat and a contentUrl:

Distribution Format
Bulk catalogue dump (CSV) text/csv
Bulk catalogue dump (JSON-LD) application/ld+json
Combined CIDOC-CRM graph text/turtle
Linked-data graph (front door) application/ld+json
Per-record graph neighbourhood application/ld+json, text/turtle, application/rdf+xml
OAI-PMH harvesting endpoint text/xml
VoID / DCAT discovery text/turtle

The distribution list is built from the platform's canonical list of open-data surfaces, so it stays in step with everything else the platform offers - add a new open surface and it appears here automatically.


How to use it

Register the dataset with a search engine

Submit https://YOUR-HOST/data/dataset.jsonld (or the /open-data page that can embed the same markup) through Google Search Console. Google Dataset Search reads the schema.org/Dataset node and lists your collection in its dataset index.

Validate the markup

Paste the JSON-LD into Google's Rich Results Test or the Schema Markup Validator to confirm the Dataset and its DataDownload distributions are recognised.

Fetch it programmatically

curl -H "Accept: application/ld+json" https://YOUR-HOST/data/dataset.jsonld

Notes

  • Read-only and resilient. The descriptor only reads cheap aggregate figures (a record count, a date span, the top places). If a figure is unavailable it is simply omitted - the descriptor never fails.
  • Stable URLs. Every address is built from the platform's own base URL, so the descriptor is correct on any host or behind any proxy.
  • Open licence. Everything is published under CC-BY-4.0 - reuse it with attribution.
  • Open-Data Catalogue (DCAT) - the data-portal view of the same offering (/data/catalog).
  • Open Memory Protocol - the machine index of every open-data surface (/open-data/protocol).
  • Open graph statistics - the size-and-shape figures the dataset size is drawn from (/data/stats).
  • Bulk open-data exports - the CSV / JSON-LD dumps the distributions point at.