A schema.org library for Haskell

This blogpost announces a schema.org library for haskell: schema-dot-org.

The schema.org specification lets users annotate web pages with structured data representing the contents of the web page. It contains over 2000 schemas with a very loosely defined structure, posing an interesting challenge for a Haskell library.

Quick intro to schema.org

The schema.org specification defines schemas such as for example the Event schema. These schemas are in a multi-inheritance hierarchy that define which attributes one may find in object of such a type.

That's about as structured as it gets. There are multiple ways an object might appear. It might appear as Microdata, as RDFa, as JSON-LD, or even other formats still.

Every attribute of an object is optional. Furthermore, every attribute might appear multiple times, or even appear as a list.

The Social Dance Today website uses structured data in the form of JSON-LD objects to markup event pages. Here is one example:

{
    "@context": "https://schema.org",
    "@type": "DanceEvent",
    "description": "Welcome to our student parties at Rhythmia!\nWe play salsa and bachata all night long!\nFree for all our students and their friends, CHF 10 for guests",
    "eventAttendanceMode": "https://schema.org/OfflineEventAttendanceMode",
    "eventStatus": "https://schema.org/EventScheduled",
    "image": "https://social-dance.today/image/ikvp2PJDE2lIgxgNQi4YUJLBt2zFEfTIxeGQnEo1N+A=",
    "location": {
        "@context": "https://schema.org",
        "@type": "Place",
        "address": "Badenerstrasse 551",
        "geo": {
            "@context": "https://schema.org",
            "@type": "GeoCoordinates",
            "latitude": "47.37432",
            "longitude": "8.52315"
        }
    },
    "name": "Rhythmia Student Party",
    "organizer": {
        "@context": "https://schema.org",
        "@type": "Organization",
        "name": "Rhythmia Salsa & Bachata Studio",
        "url": "https://social-dance.today/organiser/rhythmia-salsa-bachata-studio"
    },
    "startDate": "2023-03-10T21:00:00",
    "url": "https://social-dance.today/party/rhythmia-salsa-bachata-studio/rhythmia-student-party/2023-03-10"
}

The Event schema specifies which data one might put into an object like this, or which data one might expect to find in an object like this when parsing.

Use-cases

The schema.org specification is so loosely structured that it becomes important to distinguish between the two typical use-cases for a library like this: Producers and consumers.

Producers want to be as strict as they can be about producing exactly the data they want in the exact right format. They want to make sure all the timestamps are in a standardised format, and that all the data that they have to offer is definitely rendered.

Consumers, on the other hand, want to be as lenient as they can get away with when consuming structured data like this. They want to make sure they treat as few attributes as possible as required, and need to be able to deal with any time stamp format that the specification says is allowed. They also want to have to deal with all possible values that the schema specifies.

Schema.org in Haskell

A naive way to go about writing a Haskell library for these specifications would be to generate a data type per schema, and generate parsing and rendering code. However, the size and amount of schemas renders this infeasible. We would end up with a library that contains hundreds of thousands lines of code, and it would be a bother to deal with as a user. It would also likely not adequately acknowledge the different use-cases above.

So instead we have opted to generate as little code as possible. This code essentially only serves as evidence of the specification. On top of that code, we have then written two little libraries. One for producing data and one for consuming data.

Common generated library

The generated library

Producing

Looking for a lead engineer?

Hire me