NIH CFDE Selected Terminologies

Authors Philippe Rocca-Serra

Maintainers Philippe Rocca-Serra

Version: 0.1

License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication


Objectives

The main objective of this section is to draw the attention to the importance of semantics in interoperability and reusability as implemented in the C2M2 model and the associated task of ingesting datasets from an array of resources, each tackling a specific research area.

A secondary objective is to prepare for the extract/transform/load (ETL) processes by simply being aware of this model requirements.

Overview

For a number of attributes, the C2M2 model delegates to controlled terminologies and ontologies the associated value sets definition. This affords specification stability while allowing flexibility by outsourcing the maintenance needs for value set, typically when new values are required. While the C2M2 model specifications clearly identifies the elements requiring values selected from a controlled terminology, the following table offers a full overview. The table also highlights the planned implementation, with phased releases from compliance level 0 through to compliance level 2, which will see the inclusion of new types, thereby extending the interoperability aspect of the FAIR potential of datasets made available by the DCC through the Deriva-based system following the extraction transformation and load process from the source repositories.

C2M2 vetted Vocabularies

Domain

Resource Name

License

C2M2 Level 0

C2M2 level1

C2M2 level2

id_namespace_string

CFDE internal CV

NA

:heavy_plus_sign:

:heavy_plus_sign:

:heavy_plus_sign:

subject_role

CFDE internal CV

NA

:heavy_plus_sign:

:heavy_plus_sign:

subject_granularity

CFDE internal CV

NA

:heavy_plus_sign:

:heavy_plus_sign:

protocol

CFDE internal CV

NA

:heavy_plus_sign:

:heavy_plus_sign:

taxonomy

NCBITax

CC0 1.0 (public domain)

:heavy_plus_sign:

:heavy_plus_sign:

anatomy

UBERON

CC-BY

:heavy_plus_sign:

:heavy_plus_sign:

sample_type

OBI

CC-BY

:heavy_plus_sign:

:heavy_plus_sign:

assay_type

OBI

CC-BY

:heavy_plus_sign:

:heavy_plus_sign:

file_format

EDAM

CC BY-SA 4.0

:heavy_plus_sign:

:heavy_plus_sign:

data_type

EDAM

CC BY-SA 4.0

:heavy_plus_sign:

:heavy_plus_sign:

disease

MONDO

CC-BY

:heavy_plus_sign:

disease

DOID

CC-BY

:heavy_plus_sign:

Which terminologies DCCs currently use?

For each of the potential data sources and for a set of core search facets, a survey of semantic resources used by representative DCCs has been summerized in the table below.

:warning:It is worth noting that the table includes a subsection (indicated in italic) which covers identification schemes used for molecular entities. These are distinct from concept annotation with ontology terms, however since they allow interoperability between resources, they have been included).

Domain

MW

LINCS

HMP

GTEx

4D Nucleome

KidsFirst

taxonomy

free text

free text

free text

free text

free text

free text

anatomy

free text

free text

free text

UBERON

free text

NCIT

sample type

free text

free text

free text

UBERON

free text

NCIT

disease

free text

free text

free text

free text

free text

HPO NCIT MONDO

assay type

internal cv/free text

BAO

internal cv/free text

internal cv/free text

internal cv/free text

internal cv/free text

data type

_

free text

free text

_

internal cv/free text

internal cv/free text

chemical compound

pubchem CID,InChi

pubchem CID,InChi

_

_

_

_

gene product

refseq

_

_

_

_

_

protein

uniprot

_

_

_

_

_


Conclusions

  • By explicitly identifying a number of semantic artefacts for describing key attributes, the C2M2 defines a curation framework, with the aim of anchoring free text descriptors to controlled terms, which can be exploited for query expansion or resource linking.

  • The resource survey that has been carried out is an important step in the FAIRification process as it identifies potential areas of intervention, defined as semantic markup of free text description can deliver gains in interoperability and reusability.

  • Taking the notion of taxonomical descriptors for example, the harmonization across the various sources can be easily achieved by relying on a resource such as NCBITaxonomy and the curation action is simplified by that limited diversity of species found in the different databases.

  • On the other hand, the harmonization tasks for domains such as sample type, assay type can be more involved, not to mention the case of phenotypic descriptions or disease, even though level 0 and level 1 compliance do not expect such a degree of integration.