Skip to content

In search for ground-truth. Quantifying uncertainty in expert labelling for machine learning

Sam Mitchinson1 , Jessica Johnson1, Ben Milner2, Jason Lines2

  • Affiliations: 1School of Environmental Sciences, University of East Anglia, Norwich, United Kingdom 2School of Computing Sciences, University of East Anglia, Norwich, United Kingdom

  • Presentation type: Talk

  • Presentation time: Thursday 08:30 - 08:45, Room R380

  • Programme No: 3.1.11

  • Theme 3 > Session 1


Abstract

High-quality ground-truth labels are essential for training reliable and accurate deep learning models. While advances in the automatic classification of volcano-seismic signatures hold promise for global implementation in volcano observatories, training data often remain sparse and are typically labelled by a single expert, with uncertainty frequently overlooked or unquantified. This study investigates the level of agreement among experts when classifying volcano-seismic signatures, highlighting potential ambiguities in the training data that may impact model performance. The specific objectives of this study are: (1) to evaluate agreement among experts on volcano-seismic signatures, and (2) to develop an agreement-based classification method for volcanic earthquakes. The study involves designing an online questionnaire, distributed globally to volcano experts, and asking participants to classify volcano-seismic signatures into predefined categories: volcano-tectonic, long-period, hybrid, and other. Participants will provide a likelihood score on a continuous scale from -1 (certain the signal did not belong to the category) to 1 (certain the signal did belong to the category). Annotator agreement and uncertainty will be assessed using Kendall's W and Fleiss' Kappa, both robust statistical measures of inter-rater reliability. Ground-truth labels will be assigned probabilities based on the maximum likelihood derived from all discrete classes. Incorporating uncertainty into the data provides transparency in the machine learning process and may offer insights into whether the current standard classification groupings are truly adequate for precise volcano monitoring practices.