In search for ground-truth. Quantifying uncertainty in expert labelling for machine learning

Sam Mitchinson¹, Jessica Johnson¹, Ben Milner², Jason Lines²

Affiliations: ¹School of Environmental Sciences, University of East Anglia, Norwich, United Kingdom ²School of Computing Sciences, University of East Anglia, Norwich, United Kingdom

Presentation type: Talk

Presentation time: Thursday 08:30 - 08:45, Room R380

Programme No: 3.1.11

Theme 3 > Session 1

Abstract

High-quality ground-truth labels are essential for training reliable and accurate deep learning models. While advances in the automatic classification of volcano-seismic signatures hold promise for global implementation in volcano observatories, training data often remain sparse and are typically labelled by a single expert, with uncertainty frequently overlooked or unquantified. This study investigates the level of agreement among experts when classifying volcano-seismic signatures, highlighting potential ambiguities in the training data that may impact model performance. The specific objectives of this study are: (1) to evaluate agreement among experts on volcano-seismic signatures, and (2) to develop an agreement-based classification method for volcanic earthquakes. The study involves designing an online questionnaire, distributed globally to volcano experts, and asking participants to classify volcano-seismic signatures into predefined categories: volcano-tectonic, long-period, hybrid, and other. Participants will provide a likelihood score on a continuous scale from -1 (certain the signal did not belong to the category) to 1 (certain the signal did belong to the category). Annotator agreement and uncertainty will be assessed using Kendall's W and Fleiss' Kappa, both robust statistical measures of inter-rater reliability. Ground-truth labels will be assigned probabilities based on the maximum likelihood derived from all discrete classes. Incorporating uncertainty into the data provides transparency in the machine learning process and may offer insights into whether the current standard classification groupings are truly adequate for precise volcano monitoring practices.