sonar

The SoNaR corpus is a large Dutch text corpus developed for linguistic research. It consists of two main parts: SoNaR-500, with over 500 million words from a wide variety of domains and genres, and SoNaR-1, a manually verified 1-million-word subset with extensive semantic annotations. The corpus includes automatic and manual annotations such as tokenization, part-of-speech tagging, lemmatization, named entity recognition, coreference annotation, and annotation of spatial and temporal relations. SoNaR supports research in computational linguistics, language modeling, and natural language processing, and is maintained by the Dutch Language Institute (INT).

Tags: ssh dutch linguistic corpus

URL(s):

View Assessments

Associated Projects (1)

becos eval ineo

project

The BeCoS Corpus is a Dutch speech dataset containing recordings and annotated transcriptions of spo...

ssh dutch speech corpus

Associated Rubrics (1)

FAIR metrics by fairmetrics.org

any

A set of metrics to assess the FAIRness of digital resources. Based on work published in https://www...

fair metrics universal core metrics all digital objects