FAIRshake

sonar

The SoNaR corpus is a large Dutch text corpus developed for linguistic research. It consists of two main parts: SoNaR-500, with over 500 million words from a wide variety of domains and genres, and SoNaR-1, a manually verified 1-million-word subset with extensive semantic annotations. The corpus includes automatic and manual annotations such as tokenization, part-of-speech tagging, lemmatization, named entity recognition, coreference annotation, and annotation of spatial and temporal relations. SoNaR supports research in computational linguistics, language modeling, and natural language processing, and is maintained by the Dutch Language Institute (INT).

Tags: ssh dutch linguistic corpus

URL(s):

https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/

View Associations

Digital Object Assessments (1)

Assessment			Metrics																	Date
Rubric	Project		BeCos ineo	Globally unique identifier	Persistent identifier	Machine-readable metadata	Standardized metadata	Resource identifier in metadata	Resource discovery through web search	Open, Free, Standardized Access protocol	Protocol to access restricted content	Persistence of resource and metadata	Resource uses formal language	FAIR vocabulary	Linked	Digital resource license	Metadata license	Provenance scheme	Certificate of compliance to community standard	Date
FAIR metrics by fairmetrics.org	becos eval ineo		yes (1.00)	yes (1.00)	no (0.00)	yes (1.00)	no (0.00)	yes (1.00)	yes (1.00)	yes (1.00)	no (0.00)	no (0.00)	no (0.00)	no (0.00)	no (0.00)	yes (1.00)	no (0.00)	no (0.00)	no (0.00)	Jun 23, 2025