Fluorescent Neuronal Cells dataset - part III

Benchmark metrics for evaluation

Dec 13, 2022

5 min read

In the third and last article of this series, we examine a task-specific list of metrics that suit well the analysis of Fluorescent Neuronal Cells (FNC) data.

If you missed the first parts, check them out for more details on i) how the data were gathered and what they represent:

Fluorescent Neuronal Cells dataset – part I

and ii) the specific challenges associated with FNC data:

Fluorescent Neuronal Cells dataset – part II

Model evaluation and performance assessment are critical steps in data analysis pipelines. Of course, there are various approaches available for this purpose.

Given this variety, the key aspect to remember is that each strategy emphasizes different capabilities of the model. Thus, the performance may vary substantially depending on the reference metrics.

For this reason, we must select wisely the evaluation plan to reflect the final use of our model.

In the following, we discuss a few metrics suitable for the FNC data. Specifically, we consider 3 scenarios depending on the learning task: semantic segmentation, object detection and object counting.

Segmentation metrics

For semantic segmentation, we can adopt standard metrics such as the Dice coefficient or the Mean Intersection over Union.

However, they may be spoiled by the subjective recognition of borderline cells and potential inaccuracies in the annotations. Hence, we need to take it into account when interpreting such indicators.

A primary source of noise comes from the annotation procedure. Indeed, the ground-truth labels were produced with a semi-automatic approach involving adaptive thresholding and manual annotation. The former generates masks having jagged cell contours, while the latter presents objects with smoother borders.

As a consequence, we may observe minor repeated errors in the segmentation of borders even when the bulk of cells is correctly recognized.

Hence, the sole indicator values are insufficient for a truthful assessment. Instead, a thorough evaluation needs a bigger picture and must be tailored to the end goal of the analysis.

In practice, the suggestion is to chase higher performance when the target is precise segmentation. On the contrary, we may relax the requirements when the ultimate interest is more in identifying the objects.

Detection metrics

Regarding object detection metrics, commonly used indicators such as F1 score, precision and recall can be adopted. The key element to determine is a definition of true positives, true negatives, and false positives. In fact, this must be tailored to the specific characteristics of our data.

In the case of FNC, a dedicated algorithm was designed. This allows reasonable flexibility in the association between predicted and target objects.

Specifically, each predicted object is compared to all cells in the corresponding ground-truth label and uniquely linked with the closest one. If their centroids are less distant than the average cell diameter (50 pixels), the predicted element is considered a match. Hence, it increases the true positive count (TP).

At the end of this procedure, all true objects without matches are considered false negatives (FN). Likewise, the remaining detected items not associated with any target are considered false positives (FP).

For detection metrics, we do not encounter the same flaws described before for segmentation indicators.

Nonetheless, the presence of borderline cells makes our assessment vulnerable to the subjectivity of some annotations. In such cases, the disagreement between target and predicted objects typically lies within the limits of subjective operator interpretation.

However, this consistency is not captured by the metrics. Hence, we observe lower performance although the results are perfectly compatible with human judgment.

In summary, we can look at all indicators jointly for a comprehensive understanding of the strengths and weaknesses of our model.

Counting metrics

There are several alternatives to assess model’s counting ability, each with pros and cons. The suggested strategy is to leverage different indicators together to evaluate the results from multiple complementary angles.

One way is to simply consider the discrepancy between the number of cells in ground-truth masks and predicted ones. For example, we can consider the Absolute Error to get an idea of the actual distance between target and predicted counts.

However, a given margin indicates a more or less severe error depending on the total number of target cells. For this reason, we can add the Percentage Error as an additional evaluation element. In addition, this provides information on whether we are over-/under-estimating the counts.

Although the above quantities are intuitive, they may hide poor performances when the counts’ distribution has low variability. Thus, we can complement the assessment by looking at the R² coefficient of determination. This can be read as the portion of variance explained by the model. Hence, it gives a sense of how well our model captures the variability of the phenomenon.

All in all, the suggestion is to look at the three indicators jointly to have a more comprehensive understanding of the strenghts and weaknesses of our model.

In this article, we examined several benchmarks for evaluating models trained using the Fluorescent Neuronal Cells dataset.

Of course, the final choice depends on the specific requirements for your analysis. Also, bear in mind that pure metric values are subject to limitations due to the natural nuisance of the data.

Now I really want to know your take!

Do you think this list is exhaustive? Can you think of better or complementary metrics?

Let me know in the comments!

_If you liked the topic, you can read a more detailed discussion in [1, 2]. Also, you can go ahead and download the dataset, experiment with the code of the original paper and play yourself with the data._