How deep learning could revolutionise the identification of leukocytes on blood smears
This article was written in collaboration with Mathieu Sarrat and Laleh Ravanbod.
In this article, we will do a little bit of biology and explore how deep learning can help to classify blood cells. The diagnosis of many pathologies, such as infectious diseases, leukaemia or other haematological disorders rely on the classification of subtypes of white blood cells, a.k.a. leukocytes. Several biological techniques exist to identify leukocytes but the microscopic examination of blood smears is often crucial for the confirmation of diagnosis. Cells are identified by different characteristics: their granules, the number of lobes in the nucleus, the shape of the nucleus, the colour of the cytoplasm.
This technique however is prone to error, time-consuming and requires experts, which is why computer-aided analysis of blood smears have been developed. Classical leukocyte classification pipelines involve segmentation of the cell from its surrounding, feature extraction and selection, followed by shallow machine learning classifiers. This type of workflow is extremely difficult to generalise from one laboratory to the other because of the variety of staining, protocols and acquisition systems.
This is where deep learning can become handy.
Classically, circulating blood cells are split into 5 major subtypes:platelets, red blood cells, granulocytes (basophils, neutrophils, eosinophils), monocytes, and lymphocytes.
We have tested here 3 types of deep learning architecture to classify 11 classes of healthy blood cells: the 5 major subtypes and some of their progenitors (neutrophils progenitors).
- neutrophils (segmented) β SNE
- eosinophils β EO
- basophils β BA
- lymphocytes β LY
- monocytes β MO
- platelets β PLATELET
- erythroblasts β ERB
- immature (metamyelocytes, myelocytes, promyelocytes) and band neutrophils β MMY, β MY, β PMY, and β BNE
II β EDA
a. Data distribution
We used 3 datasets available publicly: from the Core Laboratory at the Hospital Clinic of Barcelona, from Munich University Hospital, and from Razi Hospital in Rasht, Gholhak Laboratory, Shahr-e-Qods Laboratory and Takht-e Tavous Laboratory in Tehran (Raabin-WBC dataset) . Combining those datasets resulted in about 50 000 images on single cells from blood smears stained with either MGG and Giemsa and acquired on 3 different systems. The majority of patients were characterised as healthy apart from few patients diagnosed with leukaemia.
Letβs do some EDA to check our data. As you can see on the barplot below, segmented neutrophils (the most advanced stages of neutrophil maturation, for short SNE) are overrepresented in the Munich dataset. The lymphocyte population is in the majority in the Raabin and the Munich dataset. The classes are more balanced in the Barcelona dataset. Platelets are only present in the later dataset.
b. UMAP and dimension reduction
We chose to use UMAP for dimension reduction as it preserves both local and most of the global structure in the data. Looking at data projected into 3 main components, we can see that the variability is principally explained by the origin of the data, which could be due to divergence in staining and luminosity variation (see our Streamlit). The other main variable driving clustering appears to be the size of cells.
Our data are diverse and could constitute an interesting dataset for building classification models.
III β Modelisation
To classify images, we have tested 3 different architectures which are described in more detail in our Streamlit app: VGG16 coupled to a SVM, VGG19 and ViT. Our baseline model was a Logistic Regression which showed an accuracy of about 65β70%.
a. Transfer Learning and fine tuning
We used transfer learning for all neural networks of models trained on ImageNet and trained more or less layers as detailed for each model later on.
b. Pre-processing and regularisation
Our dataset was split into train, validation and test sets (80%, 10%, 10%). Data augmentation was used in order to reduce over-fitting and improve the ability to generalise.Training data flow through the following pipeline : we resized and reshaped the pictures, then applied some augmentation (rotation, axial symmetry, shear).
Moreover, we implemented the following during the training of our models:
- Early stopping callback with a pre-defined patience: to prevent over-fitting.
- Reduce learning rate on plateau: to accelerate training and convergence.
Finally, we used class weights during the training : we penalised more when errors are committed on low-population classes (e.g. PMY or BA): (1/class_count) * (total_count/2) .
c. The models
1) VGG16 with SVM on segmented cells
Using VGG16 coupled with SVM, our objective was to study the effect of background filtering on the precision of the model. We used a VGG16 network for feature extraction connected to a Support Vector Machine for classification. The procedure of background filtering or cell segmentation, shown in the following figure, was based on the assumption that the cell is situated near the centre of the image and is purpler than other parts of the image.
The 3 fully connected layers at the top of the network VGG16 were not included as a base model, which was completed with the following layers: GlobalAveragePooling2D, a Dense layer (1024), a Dropout layer with a 0.2 rate, a Dense layer (512), followed by another Dropout and finally a Dense classification layers. Training is done in 3 steps: all the layers are trained, then all layers but the 4 last are trained, finally an intermediate layer model is defined including the base model, the GlobalAveragePooling2D layer and the Dense layer (1024). The outputs of the intermediate model are computed for the training data and used to train an SVM.
F1 scores were not different before and after filtering. In other words, the model finds the information of a cell regardless of its background.
2) VGG19
For VGG19, the last convolutional block was fine-tuned and the classification block layers were replaced with our own custom block. Specific VGG19 preprocessing was applied, i.e. the inversion of the RGB channels and normalisation of the data.
Performance looked good and we obtained a global accuracy of around 94% on test data, but it is important to remember that test data and training data come from the same sources (a mix from Barcelona, Munich and Raabin datasets).
3) Vision Transformer
Vision transformers (ViT) have shifted the field of computer vision. We chose to test a basic ViT-b16 from the ViT-Keras library. In ViT, images are cut into patches (16 x16 pixels for the ViT-b16) that are flattened and connected through a positional embedding. These projections are then fed into transformed encoder layers followed by a multi-layer perceptron (MLP) acting a bit like a decoder. A final MLP head is responsible for the final classification.
For those models, we added a label smoothing as an extra regularisation to try to take into account labelling errors. We also choose to test Rectified Adam optimiser supposedly less sensitive to the learning rate and more generalisable.
The model performs a bit less well than the VGG19 with a global accuracy of 92% on validation and test data.
IV β Results analysis
Here are the performances of the 3 models as F1-scores for each class and global accuracy :
a. Cell maturation and labelling
Most pictures are correctly classified (F1 > 0.95), except the different kind of neutrophils.The confusion matrix reveals the 3 models are muddling up the different kind of neutrophils (mature SNE and immature PMY, MMY, MY, BNE ), as you can see it below (results obtained with VGG19), where the strongest percentages of misclassified pictures have been circled in red.
But why??? To understand, we need to invoke biology : PMY, MY, MMY, BNE and SNE are steps in the neutrophilic granulocyte maturation process. This process is a continuous one (e.g. the nucleus slowly evolves from a potato shape to a multi-lobed one), so it is believable we find some cells with features of two successive growth steps. The model must choose a class, then we can get classification errors.
For the same reasons, some labellers have difficulty to come into agreement and there can be possible labelling errors (pictures below) :
Labelling such pictures is a complex work, and requires well trained experts which are not unerring.
b. Explainability
Now, we can go further ( βͺγ―βͺ)γΚΈα΅α΅Κ°α΅ and investigate what our models look at in a picture before assigning it to a class. With VGG-based models, we used Grad-CAM. With ViT, we used Attention Maps. These two techniques highlight on the picture the most important features for a given prediction.
We load an eosinophil cell (EO) coming from the Raabin dataset in VGG19 in our Streamlit app and here is a screenshot of it:
The main characteristics of an EO are a segmented nucleus and pink granules in the cytoplasm. Grad-CAM reveals the model searches for the pink granules when he tries to determine if the picture is an EO. ViT attention map seems to be slightly more focused on the nucleus, but also takes the granules into consideration. For other classes, Grad-CAM and Attention Map are more complex to interpret, but Grad-CAM and Attention Map show the models focus on the cell, at the centre of the picture, and not on the red cells background.
V β Limitations: what happens outside the validation set.
We have tested pictures which do not come from the datasets Raabin, Barcelona and Munich. For example, VGG19 is able to accurately classify some pictures that are rather similar in shape and colorations, like the four pictures in the figure above, obtained on Google Images.
However, the model miserably fails (global accuracy around 8 %) on some datasets which are very different from the training one, like WBC_segmentation, coming from Jiangxi Tecom Science Corporation, China.
Those images were taken by a different microscope optical system, the blood smears were processed with a lab-specific coloration and, above all, the picture resolution is low compared to the training dataset. Thus, further progress can be made.
VI β Conclusion
Our models do well in classifying 11 classes of blood cell pictures from 3 different datasets, but many improvements are possible. Moreover, we have seen that good global metrics can hide important problems, e.g. with the modelβs ability to generalise on datasets quite different. Here are some potential improvements:
a. Increase diversity: augment and vary.
- more diverse sources or types of pictures: including images coming from other institutions (different acquisition systems, different staining, luminosity etcβ¦) will allow for a more balanced dataset (less SNE, and LY, more immature) and a better ability for the model to generalise on new data.
- Augment our data: we could try to mimic real-life histology staining, alter actual staining or use GANs (Generative Adversarial Networks) or VAE (Variational Auto-Encoders) to produce new pictures.
- We could then consider the abandonment of the transfer learning approach and consider a complete training of our models on blood cells pictures.
b. Improve the labelling: use the biological definition.
We need to train the model with pictures that are confidently labelled. One option is a cross-validation process between independent expert pathologists, as it was done in the Raabin project but this is time-consuming and resource-intensive, so self-supervised or semi-supervised learning could be more reasonable alternatives.
Another option would be the use of transcription factors and granule proteins as labels. This could be done by co-staining with antibodies or using flow cytometry labelling prior to staining.
c. Object detection: for full use.
We have only worked on one step of the full process : we use segmented pictures, with only one cell at the centre. A possible extension of this work could involve object (= blood cell) detection (e.g. with YoloV5) on a large-scale picture of a blood smear, in order to produce the kind of pictures we use in this app.
We hope you enjoy reading on our project and if you would like to get in touch donβt hesitate!
Share This Article
Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.
Write for TDS