Properties of a High Quality OSV Record

Version

1.0.0 (SEMVER)

Purpose

Describe the “good enough” OSV record that will be imported by OSV.dev

Out of scope

This does not discuss the problem of record bit rot over time, after initial successful import. The problem of continuous revalidation and treatment of records that have been successfully imported will be dealt with separately in the companion to this, Managing the Perishability of OSV Records.

Deferred to a future iteration: validating the existence of vulnerable functions in the ecosystem_specific field, if supplied.

Audience

  1. OSV record producers
  2. Downstream OSV.dev record consumers

Rationale

OSV.dev seeks to be a comprehensive, accurate and timely database of known vulnerabilities that is highly automation friendly. In order to meet this accuracy goal, a quality bar needs to be both defined and sustainably enforced.

Properties of a High Quality OSV Record

Valid

As a prerequisite, it is assumed that a record passes JSON Schema validation for the version of the OSV Schema it declares itself to comply with in the schema_version field, or 1.0.0 if it does not. It is also assumed that the vulnerability discussed in the OSV record is valid and affects the software described.

Precise

A high quality OSV record allows a consumer of that record to be able to answer the following questions in an automated way, at scale:

  • “Does this vulnerability, as described, impact me?”
    • “What version do I need to upgrade to, or what patches do I need to apply, for it not to impact me?”
    • “Should I replace or remove this (potentially orphaned) package with known unfixed vulnerabilities?”

The definition of “impact” will vary depending on how fine-grained the information available is (i.e. package-level or symbol-level for software library packages). Package-level precision is the minimum standard.

  • for version and commit ranges
    • affected[].ranges[].events[].introduced is defined
    • prefer affected[].ranges[].events[].fixed over affected[].ranges[].events.last_affected
      • this minimizes false negatives
    • distinct ranges for introduced..fixed and/or introduced..last_affected (i.e. introduced and fixed versions or commits can’t be the same)
    • values in introduced are before/less than fixed/last_affected according to the canonical package registry or project version control
    • for version (ECOSYSTEM and SEMVER) ranges
      • the versions exist in the specific package ecosystem
    • for commit (GIT) ranges
      • the commits exist in the specified repo (i.e. they are not from another GitHub fork)
  • the package.ecosystem, and a unique identifier prefix for it, are defined in the OSV Schema
  • the package.name exists within the defined package.ecosystem, and is canonically encoded to be unambiguous (i.e. normalized)
  • Package URLs in the package.url field conform to the specification
  • reference URLs return a 2xx or 3xx response at the time of publication

Identifiable

  • Where relevant, an alias to the equivalent CVE record is present
  • Where an OSV record consolidates multiple vulnerabilities in another ecosystem (or universe), multiple related identifiers are present

Examples

  • GO-2024-2687
    • Has introduced and fixed versions
    • Has an alias to a CVE record ID
    • Has a purl
  • OSV-2024-98
    • Has introduced and fixed commits
      • commits exist in repo
  • DSA-5678-1
    • Has introduced and fixed versions
    • Has multiple related CVE record IDs

Appendix A: OSV Schema validation

(As at version 1.6.3, generated by Gemini from the OSV JSON schema)

Top-Level Information:

  • id: A unique string identifier for the vulnerability.
  • modified: A timestamp (in RFC3339 format, in UTC, ending in “Z”) indicating when the vulnerability information was last updated.

Optional, but validated when present:

  • schema_version: A string specifying the version of the schema being used.
  • published/withdrawn: Timestamps (in RFC3339 format, in UTC, ending in “Z”) for when the vulnerability was published or withdrawn.
  • aliases/related: Arrays of strings for alternate identifiers or related vulnerabilities.
  • summary/details: String descriptions of the vulnerability.
  • severity: An array of objects detailing the severity using different scoring systems (e.g., CVSS v2, v3, or v4), if available.
  • affected: An array of objects describing which packages are affected, including details like:
    • package: The ecosystem (e.g., npm, PyPI), name, and Package URL (PURL) of the affected package.
    • severity: Severity for the specific package (if different from the overall severity).
    • ranges: Information on the affected version ranges, commit ranges, or ecosystem-specific identifiers.
    • versions: A list of specific affected versions.
    • ecosystem_specific/database_specific: Additional data specific to the package ecosystem or the vulnerability database.
  • references: An array of objects providing URLs to external resources about the vulnerability, categorized by type (e.g., advisory, article, discussion).
  • credits: An array of objects giving credit to individuals or organizations involved in discovering, reporting, or fixing the vulnerability.
  • database_specific: A flexible object for any extra information specific to the database using this schema.

Additional Validation Rules:

  • timestamp: A custom definition that ensures timestamps adhere to the RFC3339 date-time format (e.g., “2023-11-15T12:34:56Z”).
  • additionalProperties: false: This prevents any extra properties from being added to the JSON object beyond those defined in the schema.
  • **Specific Requirements in affected Array:
    • There are conditional validations based on the type of range, ensuring the correct properties are present (e.g., repo is required when type is GIT).
    • A logical check ensures that if last_affected is specified in events, then fixed cannot be present in the same events array.