Schema.org, BioSchemas, JSONSchema, JSON-LD and DATS¶

A case study with KidsFirst is considered for asset metadata serialization as DATS, and validation with JSONSchema

Version: 1.0

License: GPLv2+

Motivations¶

The Data Tag Suite (DATS) metadata model as described in this paper and fully codefied in this repository strives to model datasets irrespective of their domains. DATS embodies several key elements making it extroudinarily useful for FAIRification:

Machine Readibility: Datasets described with a consistent DATS format permit machines to resolve FAIR metadata such as identifiers, authorship, funding, citation, license, consent, access, provenance, and ultimatly topic as well.
RDF Interoperability: Serialized in strict JSON-LD, the DATS format is renderable as an RDF graph permitting interoperability with ontological vocabularies and existing dataset description formats including schema.org and the Open Biological and Biomedical Ontology (OBO).
Findability: The utilization of these consistent formats will permit various existing services and endless future ones to be able to identify aspects of the dataset for the purposes of indexing and searching. One such service is google dataset search which utilizes schema.org metadata.
CFDE Compatibility: Tooling has been created to convert DATS to the C2M2 and for automatically evaluating the FAIRness of Datasets through the DATS metadata model.

Ingredients¶

Access to a manifest or API for serving your existing datasets.

Objectives¶

Convert your existing metadata into the DATS metadata schema
Check the validity of your DATS schema

Preparation¶

We need to get the manifest or access to an API serving the existing metadata. In our case study, KidsFirst, the assets are browsable in the file repository. After enabling all “Columns” click the “Export TSV” button and save that file to ./data/file-table.tsv.

# Python tool for data table processing
import pandas as pd
# Jupyter Notebook display helper
from IPython.display import display

df = pd.read_csv('./data/file-table.tsv', sep='\t', low_memory=False)
display(df.head())

	File ID	Participants ID	Study Name	Proband	Family Id	Data Type	File Format	File Size	Participant External ID	File Name	File External ID	Aliquot External ID	Sample External ID	Biospecimen ID	Tissue Type (Source Text)	Diagnosis (Source Text)	Study ID	Latest DID	Observed	Repository
0	GF_000VDK42	PT_PR4YBBH3	Pediatric Brain Tumor Atlas: CBTTC	Yes	--	Aligned Reads	bam	11041645308	C29274	62c0c6fe-99f8-4ff7-b3b5-233e6cc2ff0f.bam	62c0c6fe-99f8-4ff7-b3b5-233e6cc2ff0f.bam	746063	7316-126-T-112502.RNA-Seq	BS_A7Q8G0Y1	Tumor	Brainstem glioma- Diffuse intrinsic pontine gl...	SD_BHJXBDQK	bdf6c2f6-1500-4693-ae32-fd18dc4ab9e1	--	gen3
1	GF_000WBJCD	PT_NK8A49X5	Pediatric Brain Tumor Atlas: PNOC	Yes	--	Annotated Somatic Mutations	maf	122552	P-06	77af1324-3754-4e34-a208-d1342a2f2ca6.mutect2_s...	harmonized/simple-variants/77af1324-3754-4e34-...	A08713, A08710	A08692-T.WXS, A08691-N.WXS	BS_6DT506HY, BS_5DPMQQVG	Tumor, Normal	Brainstem glioma- Diffuse intrinsic pontine gl...	SD_M3DBXD12	16712090-50f7-4cd1-bf2d-90ce989c2139	--	gen3
2	GF_001JWT9N	PT_G16VK7FR	Pediatric Brain Tumor Atlas: PNOC	Yes	--	Gene Fusions	pdf	5779507	P-37	9f586dc1-1df8-4f59-9a01-803141fffb94.arriba.fu...	harmonized/gene-fusions/9f586dc1-1df8-4f59-9a0...	A19683	A19649-T.RNA-Seq	BS_XGDPK33A	Tumor	Brainstem glioma- Diffuse intrinsic pontine gl...	SD_M3DBXD12	49f7aead-ce23-4c38-b566-1d99cb5a5435	--	gen3
3	GF_002DRSGP	PT_2HN13G42	Kids First: Congenital Diaphragmatic Hernia	Yes	FM_F9S808PW	Aligned Reads	cram	23537329440	CDH14-0006	CDH14-0006.cram	s3://kf-study-us-east-1-prd-sd-46sk55a3/source...	CDH14-0006	CDH14-0006	BS_BKH9S8YN	Normal	congential diaphragmatic hernia	SD_46SK55A3	c4c9d542-21fb-487d-ac07-916d466774a8	true, false, false, false	gen3
4	GF_004J173A	PT_MH56TZJD	Kids First: Congenital Diaphragmatic Hernia	No	FM_1EFM6M40	Aligned Reads	cram	17290282717	CDH4-84F	CDH4-84F.cram	s3://kf-study-us-east-1-prd-sd-46sk55a3/source...	CDH4-84F	CDH4-84F	BS_MP5P3ZPH	Normal	--	SD_46SK55A3	02642bc8-ae46-45a1-8932-2765cf1df480	--	gen3

DATS Conversion¶

The full DATS schema is available here, it includes a JSON Schema definition as well as a visualization of how things fit together.

The root schema for datasets is: .

DATS uses a strict JSON-LD serialization.

There are several ways to get a sense of what the metadata model entails. In some cases starting with an example is easier, but everything is easier with autocompletion and type-hints. Several code editors support JSON Schema for auto completion (see this).

With visual studio code, you can set this up by linking to the schema with a $schema field.

%%sh
# Create a file DATS-Validation.json with the following contents
#  this is a simple json-schema which references the public DATS schema validator
cat > DATS-Validation.json << EOF
{
  "\$schema": "http://json-schema.org/draft-04/schema",
  "type": "object",
  "properties": {
    "dats": {
      "\$ref": "https://raw.githubusercontent.com/datatagsuite/schema/master/dataset_schema.json"
    }
  }
}
EOF

# Create a file to edit which will validate against the file from the DATS-Validation file
cat > my-dats.json << EOF
{
  "\$schema": "./DATS-Validation.json",
  "dats": {
    "title": "My First Json-Schema Validated DATS Object"
  }
}
EOF

Modifying the created my-dats.json, you should be able to explore the fields through autocompletion with an editor that supports it.

Validation hints of missing properties

Auto completion for property fields

Another way, or perhaps also, is to learn by example. Several other DCC’s assets were processed and converted to DATS here, example files and scripts can be found in the DCC_name/output and DCC_name/scripts directories respectively.

It’s now time to convert what we can into DATS, striving to capture as much as possible from the original table.

Challenge 1: What do you mean by Dataset?¶

Even in this case, the definition of a Dataset becomes problematic and unclear. Remember that things we codify are often models and as such are not always perfect. Rather than thinking about Dataset with your interpretation of what it is, think of it in terms of how it will end up being used.

This is what a ‘dataset’ looks like on Google Dataset Search; the same fields will be used, and more; for your own assets.

Google Dataset Search

In other words, irrespective of what your definition is of a ‘dataset’, you should consider using something that is identifiable enough to have its own unique metadata including dedicated landing page, unique identifier, citation, license, and more. File assets associated with that dataset will be listed under the dataset.

Importantly, Datasets should ideally be associatable with singular biosamples when possible, so in some cases, it may make sense to consider each individual file to be its own dataset if each individual file is actually established for every biosample.

Do note that DATS also supports Dataset in Dataset relationships if that becomes necessary.

# The KidsFirst table has 5 primary entity types in this file and a unique identifier
display(df[['File ID', 'Participants ID', 'Study ID', 'Biospecimen ID', 'Latest DID']].head())

# Using JSON-LD and keeping in mind the arbitrary DATS structure,
#  things should end up looking like so:
def dats_from_record(record):
  return {
    '@type': 'Dataset',
    'identifier': {
      '@type': 'Identifier',
      'identifier': record['Latest DID'],
    },
    'producedBy': {
      # The dataset in question was produced as part of a study
      '@type': 'Study',
      'identifier': {
        '@type': 'Identifier',
        'identifier': record['Study ID'],
      },
    },
    'isAbout': [
      {
        # The dataset in question has a biospecimen
        '@type': 'BiologicalEntity',
        'identifier': {
          '@type': 'Identifier',
          'identifier': record['Biospecimen ID'],
        },
      },
      {
        # The dataset in question is about this participant
        '@type': 'StudyGroup',
        'identifier': {
          '@type': 'Identifier',
          'identifier': record['Participants ID'],
        },
      },
    ],
    'distributions': [
      {
        # The dataset in question has this file
        '@type': 'DatasetDistribution',
        'identifier': {
          '@type': 'Identifier',
          'identifier': record['File ID'],
        },
      }
    ],
  }

# Converting each element to DATS
dats = {
  # schema.org context, gives RDF meaning to `@type` and `predicates` as defined by schema.org
  '@context': 'http://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld',
  '@graph': [
    dats_from_record(record)
    for _, record in df.head().iterrows()
  ]
}
display(dats)

	File ID	Participants ID	Study ID	Biospecimen ID	Latest DID
0	GF_000VDK42	PT_PR4YBBH3	SD_BHJXBDQK	BS_A7Q8G0Y1	bdf6c2f6-1500-4693-ae32-fd18dc4ab9e1
1	GF_000WBJCD	PT_NK8A49X5	SD_M3DBXD12	BS_6DT506HY, BS_5DPMQQVG	16712090-50f7-4cd1-bf2d-90ce989c2139
2	GF_001JWT9N	PT_G16VK7FR	SD_M3DBXD12	BS_XGDPK33A	49f7aead-ce23-4c38-b566-1d99cb5a5435
3	GF_002DRSGP	PT_2HN13G42	SD_46SK55A3	BS_BKH9S8YN	c4c9d542-21fb-487d-ac07-916d466774a8
4	GF_004J173A	PT_MH56TZJD	SD_46SK55A3	BS_MP5P3ZPH	02642bc8-ae46-45a1-8932-2765cf1df480

{'@context': 'http://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld',
 '@graph': [{'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': 'bdf6c2f6-1500-4693-ae32-fd18dc4ab9e1'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier', 'identifier': 'SD_BHJXBDQK'}},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier', 'identifier': 'BS_A7Q8G0Y1'}},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier', 'identifier': 'PT_PR4YBBH3'}}],
   'distributions': [{'@type': 'DatasetDistribution',
     'identifier': {'@type': 'Identifier', 'identifier': 'GF_000VDK42'}}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '16712090-50f7-4cd1-bf2d-90ce989c2139'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier', 'identifier': 'SD_M3DBXD12'}},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_6DT506HY, BS_5DPMQQVG'}},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier', 'identifier': 'PT_NK8A49X5'}}],
   'distributions': [{'@type': 'DatasetDistribution',
     'identifier': {'@type': 'Identifier', 'identifier': 'GF_000WBJCD'}}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '49f7aead-ce23-4c38-b566-1d99cb5a5435'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier', 'identifier': 'SD_M3DBXD12'}},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier', 'identifier': 'BS_XGDPK33A'}},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier', 'identifier': 'PT_G16VK7FR'}}],
   'distributions': [{'@type': 'DatasetDistribution',
     'identifier': {'@type': 'Identifier', 'identifier': 'GF_001JWT9N'}}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': 'c4c9d542-21fb-487d-ac07-916d466774a8'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier', 'identifier': 'SD_46SK55A3'}},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier', 'identifier': 'BS_BKH9S8YN'}},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier', 'identifier': 'PT_2HN13G42'}}],
   'distributions': [{'@type': 'DatasetDistribution',
     'identifier': {'@type': 'Identifier', 'identifier': 'GF_002DRSGP'}}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '02642bc8-ae46-45a1-8932-2765cf1df480'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier', 'identifier': 'SD_46SK55A3'}},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier', 'identifier': 'BS_MP5P3ZPH'}},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier', 'identifier': 'PT_MH56TZJD'}}],
   'distributions': [{'@type': 'DatasetDistribution',
     'identifier': {'@type': 'Identifier', 'identifier': 'GF_004J173A'}}]}]}

There are several improvements we can make to the above:

Give context to our identifiers, which only make sense in the context of KidsFirst
Provide more metadata as available in our table

def dats_from_record(record):
  return {
    '@type': 'Dataset',
    'identifier': {
      '@type': 'Identifier',
      'identifier': record['Latest DID'],
      'identifierSource': 'https://portal.kidsfirstdrc.org/',
    },
    'storedIn': {
      '@type': 'DataRepository',
      'name': record['Repository'],
    },
    'producedBy': {
      '@type': 'Study',
      'identifier': {
        '@type': 'Identifier',
        'identifier': record['Study ID'],
        'identifierSource': 'https://portal.kidsfirstdrc.org/',
      },
      'name': record['Study Name'],
    },
    'isAbout': [
      {
        '@type': 'BiologicalEntity',
        'identifier': {
          '@type': 'Identifier',
          'identifier': record['Biospecimen ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/',
        },
        'alternateIdentifiers': [
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Sample External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Aliquot External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
        ],
      },
      {
        '@type': 'StudyGroup',
        'identifier': {
          # NOTE: Ideally, `${identifierSource}${identifier}` resolves to a landing page for this entity
          '@type': 'Identifier',
          'identifier': record['Participants ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/participant/',
        },
        'alternateIdentifiers': [
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Participant External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
        ],
      },
    ],
    'distributions': [
      {
        'identifier': {
          # NOTE: Ideally, `${identifierSource}${identifier}` resolves to a landing page for this entity
          '@type': 'Identifier',
          'identifier': record['File ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/file/',
        },
        '@type': 'DatasetDistribution',
        'formats': [
          record['File Format'],
        ],
        'size': record['File Size'],
        'unit': {
          '@type': 'Annotation',
          'value': 'bytes',
          # NOTE: Preferred valueIRI with globally unique semantic URI
        },
        'access': {
          '@type': 'Access',
          'identifier': {
            '@type': 'Identifier',
            'identifier': record['File Name'],
          },
          'alternateIdentifiers': [
            {
              '@type': 'AlternateIdentifier',
              'identifier': record['File External ID'],
              # NOTE: Preferred identifierSource with globally unique semantic URI
            },
          ],
          'landingPage': 'https://portal.kidsfirstdrc.org/file/' + record['File ID'],
          # NOTE: Ideally accessURL would be specified
        }
      }
    ],
    'types': [
      {
        '@type': 'DataType',
        'information': {
          '@type': 'Annotation',
          'value': record['Data Type'],
        },
      },
    ],
    'extraProperties': [*filter(None, [
      # Metadata that doesn't fit anywhere else in DATS but may be relevant
      {
        '@type': 'CategoryValuesPair',
        'category': 'tissue',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Tissue Type (Source Text)'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          }
        ]
      } if record['Tissue Type (Source Text)'] != '--' else None,
      {
        '@type': 'CategoryValuesPair',
        'category': 'diagnosis',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Diagnosis (Source Text)'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          }
        ]
      } if record['Diagnosis (Source Text)'] != '--' else None, # Don't create entries for junk
      {
        '@type': 'CategoryValuesPair',
        'category': 'proband',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Proband'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          },
        ],
      } if record['Proband'] != '--' else None, # Don't create entries for junk,
    ])],
  }

# Converting each element to DATS
dats = {
  # schema.org context, gives RDF meaning to `@type` and `predicates` as defined by schema.org
  '@context': 'http://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld',
  '@graph': [
    dats_from_record(record)
    for _, record in df.head().iterrows()
  ]
}
display(dats)

{'@context': 'http://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld',
 '@graph': [{'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': 'bdf6c2f6-1500-4693-ae32-fd18dc4ab9e1',
    'identifierSource': 'https://portal.kidsfirstdrc.org/'},
   'storedIn': {'@type': 'DataRepository', 'name': 'gen3'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier',
     'identifier': 'SD_BHJXBDQK',
     'identifierSource': 'https://portal.kidsfirstdrc.org/'},
    'name': 'Pediatric Brain Tumor Atlas: CBTTC'},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_A7Q8G0Y1',
      'identifierSource': 'https://portal.kidsfirstdrc.org/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': '7316-126-T-112502.RNA-Seq'},
      {'@type': 'AlternateIdentifier', 'identifier': '746063'}]},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier',
      'identifier': 'PT_PR4YBBH3',
      'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'C29274'}]}],
   'distributions': [{'identifier': {'@type': 'Identifier',
      'identifier': 'GF_000VDK42',
      'identifierSource': 'https://portal.kidsfirstdrc.org/file/'},
     '@type': 'DatasetDistribution',
     'formats': ['bam'],
     'size': 11041645308,
     'unit': {'@type': 'Annotation', 'value': 'bytes'},
     'access': {'@type': 'Access',
      'identifier': {'@type': 'Identifier',
       'identifier': '62c0c6fe-99f8-4ff7-b3b5-233e6cc2ff0f.bam'},
      'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
        'identifier': '62c0c6fe-99f8-4ff7-b3b5-233e6cc2ff0f.bam'}],
      'landingPage': 'https://portal.kidsfirstdrc.org/file/GF_000VDK42'}}],
   'types': [{'@type': 'DataType',
     'information': {'@type': 'Annotation', 'value': 'Aligned Reads'}}],
   'extraProperties': [{'@type': 'CategoryValuesPair',
     'category': 'tissue',
     'values': [{'@type': 'Annotation', 'value': 'Tumor'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'diagnosis',
     'values': [{'@type': 'Annotation',
       'value': 'Brainstem glioma- Diffuse intrinsic pontine glioma, Brainstem glioma- Diffuse intrinsic pontine glioma'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'proband',
     'values': [{'@type': 'Annotation', 'value': 'Yes'}]}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '16712090-50f7-4cd1-bf2d-90ce989c2139',
    'identifierSource': 'https://portal.kidsfirstdrc.org/'},
   'storedIn': {'@type': 'DataRepository', 'name': 'gen3'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier',
     'identifier': 'SD_M3DBXD12',
     'identifierSource': 'https://portal.kidsfirstdrc.org/'},
    'name': 'Pediatric Brain Tumor Atlas: PNOC'},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_6DT506HY, BS_5DPMQQVG',
      'identifierSource': 'https://portal.kidsfirstdrc.org/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'A08692-T.WXS, A08691-N.WXS'},
      {'@type': 'AlternateIdentifier', 'identifier': 'A08713, A08710'}]},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier',
      'identifier': 'PT_NK8A49X5',
      'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'P-06'}]}],
   'distributions': [{'identifier': {'@type': 'Identifier',
      'identifier': 'GF_000WBJCD',
      'identifierSource': 'https://portal.kidsfirstdrc.org/file/'},
     '@type': 'DatasetDistribution',
     'formats': ['maf'],
     'size': 122552,
     'unit': {'@type': 'Annotation', 'value': 'bytes'},
     'access': {'@type': 'Access',
      'identifier': {'@type': 'Identifier',
       'identifier': '77af1324-3754-4e34-a208-d1342a2f2ca6.mutect2_somatic.vep.maf'},
      'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
        'identifier': 'harmonized/simple-variants/77af1324-3754-4e34-a208-d1342a2f2ca6.mutect2_somatic.vep.maf'}],
      'landingPage': 'https://portal.kidsfirstdrc.org/file/GF_000WBJCD'}}],
   'types': [{'@type': 'DataType',
     'information': {'@type': 'Annotation',
      'value': 'Annotated Somatic Mutations'}}],
   'extraProperties': [{'@type': 'CategoryValuesPair',
     'category': 'tissue',
     'values': [{'@type': 'Annotation', 'value': 'Tumor, Normal'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'diagnosis',
     'values': [{'@type': 'Annotation',
       'value': 'Brainstem glioma- Diffuse intrinsic pontine glioma'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'proband',
     'values': [{'@type': 'Annotation', 'value': 'Yes'}]}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '49f7aead-ce23-4c38-b566-1d99cb5a5435',
    'identifierSource': 'https://portal.kidsfirstdrc.org/'},
   'storedIn': {'@type': 'DataRepository', 'name': 'gen3'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier',
     'identifier': 'SD_M3DBXD12',
     'identifierSource': 'https://portal.kidsfirstdrc.org/'},
    'name': 'Pediatric Brain Tumor Atlas: PNOC'},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_XGDPK33A',
      'identifierSource': 'https://portal.kidsfirstdrc.org/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'A19649-T.RNA-Seq'},
      {'@type': 'AlternateIdentifier', 'identifier': 'A19683'}]},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier',
      'identifier': 'PT_G16VK7FR',
      'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'P-37'}]}],
   'distributions': [{'identifier': {'@type': 'Identifier',
      'identifier': 'GF_001JWT9N',
      'identifierSource': 'https://portal.kidsfirstdrc.org/file/'},
     '@type': 'DatasetDistribution',
     'formats': ['pdf'],
     'size': 5779507,
     'unit': {'@type': 'Annotation', 'value': 'bytes'},
     'access': {'@type': 'Access',
      'identifier': {'@type': 'Identifier',
       'identifier': '9f586dc1-1df8-4f59-9a01-803141fffb94.arriba.fusions.pdf'},
      'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
        'identifier': 'harmonized/gene-fusions/9f586dc1-1df8-4f59-9a01-803141fffb94.arriba.fusions.pdf'}],
      'landingPage': 'https://portal.kidsfirstdrc.org/file/GF_001JWT9N'}}],
   'types': [{'@type': 'DataType',
     'information': {'@type': 'Annotation', 'value': 'Gene Fusions'}}],
   'extraProperties': [{'@type': 'CategoryValuesPair',
     'category': 'tissue',
     'values': [{'@type': 'Annotation', 'value': 'Tumor'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'diagnosis',
     'values': [{'@type': 'Annotation',
       'value': 'Brainstem glioma- Diffuse intrinsic pontine glioma'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'proband',
     'values': [{'@type': 'Annotation', 'value': 'Yes'}]}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': 'c4c9d542-21fb-487d-ac07-916d466774a8',
    'identifierSource': 'https://portal.kidsfirstdrc.org/'},
   'storedIn': {'@type': 'DataRepository', 'name': 'gen3'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier',
     'identifier': 'SD_46SK55A3',
     'identifierSource': 'https://portal.kidsfirstdrc.org/'},
    'name': 'Kids First: Congenital Diaphragmatic Hernia'},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_BKH9S8YN',
      'identifierSource': 'https://portal.kidsfirstdrc.org/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'CDH14-0006'},
      {'@type': 'AlternateIdentifier', 'identifier': 'CDH14-0006'}]},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier',
      'identifier': 'PT_2HN13G42',
      'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'CDH14-0006'}]}],
   'distributions': [{'identifier': {'@type': 'Identifier',
      'identifier': 'GF_002DRSGP',
      'identifierSource': 'https://portal.kidsfirstdrc.org/file/'},
     '@type': 'DatasetDistribution',
     'formats': ['cram'],
     'size': 23537329440,
     'unit': {'@type': 'Annotation', 'value': 'bytes'},
     'access': {'@type': 'Access',
      'identifier': {'@type': 'Identifier', 'identifier': 'CDH14-0006.cram'},
      'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
        'identifier': 's3://kf-study-us-east-1-prd-sd-46sk55a3/source/GMKF_Gabriel_Chung_CDH_WGS/RP-1370/WGS/CDH14-0006/v2/CDH14-0006.cram'}],
      'landingPage': 'https://portal.kidsfirstdrc.org/file/GF_002DRSGP'}}],
   'types': [{'@type': 'DataType',
     'information': {'@type': 'Annotation', 'value': 'Aligned Reads'}}],
   'extraProperties': [{'@type': 'CategoryValuesPair',
     'category': 'tissue',
     'values': [{'@type': 'Annotation', 'value': 'Normal'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'diagnosis',
     'values': [{'@type': 'Annotation',
       'value': 'congential diaphragmatic hernia'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'proband',
     'values': [{'@type': 'Annotation', 'value': 'Yes'}]}]},
  {'@type': 'Dataset',
   'identifier': {'@type': 'Identifier',
    'identifier': '02642bc8-ae46-45a1-8932-2765cf1df480',
    'identifierSource': 'https://portal.kidsfirstdrc.org/'},
   'storedIn': {'@type': 'DataRepository', 'name': 'gen3'},
   'producedBy': {'@type': 'Study',
    'identifier': {'@type': 'Identifier',
     'identifier': 'SD_46SK55A3',
     'identifierSource': 'https://portal.kidsfirstdrc.org/'},
    'name': 'Kids First: Congenital Diaphragmatic Hernia'},
   'isAbout': [{'@type': 'BiologicalEntity',
     'identifier': {'@type': 'Identifier',
      'identifier': 'BS_MP5P3ZPH',
      'identifierSource': 'https://portal.kidsfirstdrc.org/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'CDH4-84F'},
      {'@type': 'AlternateIdentifier', 'identifier': 'CDH4-84F'}]},
    {'@type': 'StudyGroup',
     'identifier': {'@type': 'Identifier',
      'identifier': 'PT_MH56TZJD',
      'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'},
     'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
       'identifier': 'CDH4-84F'}]}],
   'distributions': [{'identifier': {'@type': 'Identifier',
      'identifier': 'GF_004J173A',
      'identifierSource': 'https://portal.kidsfirstdrc.org/file/'},
     '@type': 'DatasetDistribution',
     'formats': ['cram'],
     'size': 17290282717,
     'unit': {'@type': 'Annotation', 'value': 'bytes'},
     'access': {'@type': 'Access',
      'identifier': {'@type': 'Identifier', 'identifier': 'CDH4-84F.cram'},
      'alternateIdentifiers': [{'@type': 'AlternateIdentifier',
        'identifier': 's3://kf-study-us-east-1-prd-sd-46sk55a3/source/GMKF_Gabriel_Chung_CDH_WGS/RP-1370/WGS/CDH4-84F/v2/CDH4-84F.cram'}],
      'landingPage': 'https://portal.kidsfirstdrc.org/file/GF_004J173A'}}],
   'types': [{'@type': 'DataType',
     'information': {'@type': 'Annotation', 'value': 'Aligned Reads'}}],
   'extraProperties': [{'@type': 'CategoryValuesPair',
     'category': 'tissue',
     'values': [{'@type': 'Annotation', 'value': 'Normal'}]},
    {'@type': 'CategoryValuesPair',
     'category': 'proband',
     'values': [{'@type': 'Annotation', 'value': 'No'}]}]}]}

Now we see, with some mapping effort, we were able to get all of the metadata from the file manifest table into DATS. It is important to note that there are manys fields missing including license, authorship information, and more. These fields need to be found from other places to further complete and improve this model. With our newly created object, let’s check to make sure we did not make any mistakes! For this purpose, just as we can use json-schema for auto completion help in our editor, we can also use it for programatic validation of our dats object.

from jsonschema import Draft4Validator

# Get the first record
record = dats['@graph'][0]

# Validate it against DATS dataset schema
validator = Draft4Validator({'$ref': 'https://raw.githubusercontent.com/datatagsuite/schema/master/dataset_schema.json'})
for error in validator.iter_errors(record):
  display(error.message)

"{'@type': 'BiologicalEntity', 'identifier': {'@type': 'Identifier', 'identifier': 'BS_A7Q8G0Y1', 'identifierSource': 'https://portal.kidsfirstdrc.org/'}, 'alternateIdentifiers': [{'@type': 'AlternateIdentifier', 'identifier': '7316-126-T-112502.RNA-Seq'}, {'@type': 'AlternateIdentifier', 'identifier': '746063'}]} is not valid under any of the given schemas"

"{'@type': 'StudyGroup', 'identifier': {'@type': 'Identifier', 'identifier': 'PT_PR4YBBH3', 'identifierSource': 'https://portal.kidsfirstdrc.org/participant/'}, 'alternateIdentifiers': [{'@type': 'AlternateIdentifier', 'identifier': 'C29274'}]} is not valid under any of the given schemas"

"'title' is a required property"

"'creators' is a required property"

Uh-oh; we’ve got some errors.

Let’s fix them and try again.

For readability, the changes we had to make below are here:

@@ -1,6 +1,7 @@
 def dats_from_record(record):
   return {
     '@type': 'Dataset',
+    'title': record['Study Name'],
     'identifier': {
       '@type': 'Identifier',
       'identifier': record['Latest DID'],
@@ -10,6 +11,12 @@
       '@type': 'DataRepository',
       'name': record['Repository'],
     },
+    'creators': [
+      {
+        "@type": "Organization",
+        "name": "KidsFirst",
+      }
+    ],
     'producedBy': {
       '@type': 'Study',
       'identifier': {
@@ -22,6 +29,8 @@
     'isAbout': [
       {
         '@type': 'BiologicalEntity',
+        # NOTE: name is a required field
+        'name': record['Biospecimen ID'],
         'identifier': {
           '@type': 'Identifier',
           'identifier': record['Biospecimen ID'],
@@ -42,6 +51,8 @@
       },
       {
         '@type': 'StudyGroup',
+        # NOTE: name is a required field
+        'name': record['Participants ID'],
         'identifier': {
           # NOTE: Ideally, `${identifierSource}${identifier}` resolves to a landing page for this entity
           '@type': 'Identifier',
@@ -69,7 +80,7 @@
         'formats': [
           record['File Format'],
         ],
-        'size': record['File Size'],
+        'size': int(record['File Size']),
         'unit': {
           '@type': 'Annotation',
           'value': 'bytes',
@@ -141,4 +152,4 @@
         ],
       } if record['Proband'] != '--' else None, # Don't create entries for junk,
     ])],
-  }
+  }

You can see that we put in an invalid type and were missing some fields. In some cases, we need to add metadata that wasn’t in the original table such as metadata about our own organization!

This is relevant in a catalog of many datasets but often isn’t present in your own data; it’s best if you determine how your own data will link back to your organization, than us trying to figure it out! That’s why DATS requires that metadata.

def dats_from_record(record):
  return {
    '@type': 'Dataset',
    'title': record['Study Name'],
    'identifier': {
      '@type': 'Identifier',
      'identifier': record['Latest DID'],
      'identifierSource': 'https://portal.kidsfirstdrc.org/',
    },
    'storedIn': {
      '@type': 'DataRepository',
      'name': record['Repository'],
    },
    'creators': [
      {
        "@type": "Organization",
        "name": "KidsFirst",
      }
    ],
    'producedBy': {
      '@type': 'Study',
      'identifier': {
        '@type': 'Identifier',
        'identifier': record['Study ID'],
        'identifierSource': 'https://portal.kidsfirstdrc.org/',
      },
      'name': record['Study Name'],
    },
    'isAbout': [
      {
        '@type': 'BiologicalEntity',
        # NOTE: name is a required field
        'name': record['Biospecimen ID'],
        'identifier': {
          '@type': 'Identifier',
          'identifier': record['Biospecimen ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/',
        },
        'alternateIdentifiers': [
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Sample External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Aliquot External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
        ],
      },
      {
        '@type': 'StudyGroup',
        # NOTE: name is a required field
        'name': record['Participants ID'],
        'identifier': {
          # NOTE: Ideally, `${identifierSource}${identifier}` resolves to a landing page for this entity
          '@type': 'Identifier',
          'identifier': record['Participants ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/participant/',
        },
        'alternateIdentifiers': [
          {
            '@type': 'AlternateIdentifier',
            'identifier': record['Participant External ID'],
            # NOTE: Preferred identifierSource with globally unique semantic URI
          },
        ],
      },
    ],
    'distributions': [
      {
        'identifier': {
          # NOTE: Ideally, `${identifierSource}${identifier}` resolves to a landing page for this entity
          '@type': 'Identifier',
          'identifier': record['File ID'],
          'identifierSource': 'https://portal.kidsfirstdrc.org/file/',
        },
        '@type': 'DatasetDistribution',
        'formats': [
          record['File Format'],
        ],
        'size': int(record['File Size']),
        'unit': {
          '@type': 'Annotation',
          'value': 'bytes',
          # NOTE: Preferred valueIRI with globally unique semantic URI
        },
        'access': {
          '@type': 'Access',
          'identifier': {
            '@type': 'Identifier',
            'identifier': record['File Name'],
          },
          'alternateIdentifiers': [
            {
              '@type': 'AlternateIdentifier',
              'identifier': record['File External ID'],
              # NOTE: Preferred identifierSource with globally unique semantic URI
            },
          ],
          'landingPage': 'https://portal.kidsfirstdrc.org/file/' + record['File ID'],
          # NOTE: Ideally accessURL would be specified
        }
      }
    ],
    'types': [
      {
        '@type': 'DataType',
        'information': {
          '@type': 'Annotation',
          'value': record['Data Type'],
        },
      },
    ],
    'extraProperties': [*filter(None, [
      # Metadata that doesn't fit anywhere else in DATS but may be relevant
      {
        '@type': 'CategoryValuesPair',
        'category': 'tissue',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Tissue Type (Source Text)'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          }
        ]
      } if record['Tissue Type (Source Text)'] != '--' else None,
      {
        '@type': 'CategoryValuesPair',
        'category': 'diagnosis',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Diagnosis (Source Text)'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          }
        ]
      } if record['Diagnosis (Source Text)'] != '--' else None, # Don't create entries for junk
      {
        '@type': 'CategoryValuesPair',
        'category': 'proband',
        # NOTE: Preferred categoryIRI with globally unique semantic URI
        'values': [
          {
            '@type': 'Annotation',
            'value': record['Proband'],
            # NOTE: Preferred valueIRI with globally unique semantic URI
          },
        ],
      } if record['Proband'] != '--' else None, # Don't create entries for junk,
    ])],
  }

# Converting each element to DATS
dats = {
  # schema.org context, gives RDF meaning to `@type` and `predicates` as defined by schema.org
  '@context': 'http://w3id.org/dats/context/sdo/dataset_sdo_context.jsonld',
  '@graph': [
    dats_from_record(record)
    for _, record in df.head().iterrows()
  ]
}

# Let's validate *all records*
record = dats['@graph'][0]

# Validate it against DATS dataset schema
validator = Draft4Validator({
  '$ref': 'https://raw.githubusercontent.com/datatagsuite/schema/master/dataset_schema.json'
})

for record in dats['@graph']:
  for error in validator.iter_errors(record):
    display({ 'title': record['title'], 'error': error.message })

Conclusion¶

As hoped, everything validates and we have successfully produced DATS. Though we now know our DATS is “valid”, we’re still not done. As with everything there are levels; the more fields we fill out in the DATS the better off we will be. This is where a FAIR assessment comes in – we can write metrics that also speak DATS, but are looking for presence of certain fields, or checking that our identifier can actually be verified against the given identifierSource metadata attributes.

Nonetheless, we have taken a step in the right direction. Future recipes will discuss performing FAIR assessments on this DATS, converting it to CFDE’s C2M2 Frictionless Metadata model and more!

NIH-CFDE FAIR COOKBOOK

Schema.org, BioSchemas, JSONSchema, JSON-LD and DATS

Contents