Detection and Classification of Immature Leukocytes for Diagnosis of Acute Myeloid Leukemia
Abstract:
Acute Myeloid Leukemia (AML) is a blood cancer that must be detected early for effective treatment, otherwise, it can be fatal within months or weeks. The current AML diagnostic method is a time-consuming, labor intensive microscopic evaluation of a peripheral blood smear with an error rate of 30% to 40%. During the manual examination, the clinician must classify and count each leukocyte in the blood smear to determine if there is an abnormal amount of immature leukocytes, which are a sign of blood malignancies. Herein, I present a machine learning model to automatically detect and classify immature leukocytes for efficient diagnosis of AML. Images of leukocytes in AML patients and healthy controls were obtained from a publicly available dataset in The Cancer Imaging Archive. Image format conversion, multi-Otsu thresholding, and other image processing operations were used for segmentation of the nucleus and cytoplasm. From each image, 16 cytomorphological features were extracted, two of which are novel nucleus color features proposed in this study. A Random Forest algorithm was selected to overcome the data imbalance across classes of leukocytes and was trained for the detection and classification of immature leukocytes. The model achieved 92.99% accuracy for detection and 93.45% accuracy for classification of immature leukocytes into four types in the testing set (images not previously seen by the model). Precision values for each class were above 65%, exhibiting a significant improvement over the current state of art models, which have obtained precision values below 65% for multiple classes. Based on calculations of Gini Importance, the nucleus to cytoplasm area ratio was a discriminative feature for both detection and classification, while the two proposed features were shown to be significant for classification. This is the first study that utilizes Gini Importance to mathematically calculate the importance of a variety of cytomorphological features for the classification of leukocytes. The proposed model can be used as an effective, automatic support tool for the diagnosis of AML. Future studies will also improve the classification between similar classes of leukocytes, such as promyelocytes and myeloblasts, and build on this project by evaluating additional cytomorphological features as features for the classification of leukocytes.Bibliography/Citations:
Please see my published, peer reviewed research paper (https://doi.org/10.3390/bioengineering7040120) to see citations/references for my paper.
Bibliography for Research Plan (journal articles read prior to and during the conceptualization of project):
Acute myeloid leukemia—Cancer stat facts. (n.d.). SEER. Retrieved July 13, 2020, from https://seer.cancer.gov/statfacts/html/amyl.html
Ahmed, N., Yigit, A., Isik, Z., & Alpkocak, A. (2019). Identification of leukemia subtypes from microscopic images using convolutional neural network. Diagnostics, 9(3). https://doi.org/10.3390/diagnostics9030104
American society of hematology. (n.d.). Retrieved July 13, 2020, from https://www.hematology.org:443/
Arber, D. A., Orazi, A., Hasserjian, R., Thiele, J., Borowitz, M. J., Le Beau, M. M., Bloomfield, C. D., Cazzola, M., & Vardiman, J. W. (2016). The 2016 revision to the World Health Organization classification of myeloid neoplasms and acute leukemia. Blood, 127(20), 2391–2405. https://doi.org/10.1182/blood-2016-03-643544
Bigorra, L., Merino, A., Alférez, S., & Rodellar, J. (2017). Feature analysis and automatic identification of leukemic lineage blast cells and reactive lymphoid cells from peripheral blood cell images. Journal of Clinical Laboratory Analysis, 31(2), e22024. https://doi.org/10.1002/jcla.22024
Ghane, N., Vard, A., Talebi, A., & Nematollahy, P. (2019). Classification of chronic myeloid leukemia cell subtypes based on microscopic image analysis. EXCLI Journal, 18, 382–404. https://doi.org/10.17179/excli2019-1292
Jabbar, H. K., & Khan, R. Z. (2014). Methods to avoid over-fitting and under-fitting in supervised machine learning(Comparative study). Computer Science, Communication and Instrumentation Devices, 163–172. https://doi.org/10.3850/978-981-09-5247-1_017
Jagadev, P., & Virani, H. G. (2017). Detection of leukemia and its types using image processing and machine learning. 2017 International Conference on Trends in Electronics and Informatics (ICEI), 522–526. https://doi.org/10.1109/ICOEI.2017.8300983
Kazemi, F., Najafabadi, T. A., & Araabi, B. N. (2016). Automatic recognition of acute myelogenous leukemia in blood microscopic images using k-means clustering and support vector machine. Journal of Medical Signals and Sensors, 6(3), 183–193.
Matek, C., Schwarz, S., Spiekermann, K., & Marr, C. (2019). Human-level recognition of blast cells in acute myeloid leukaemia with convolutional neural networks. Nature Machine Intelligence, 1(11), 538–544. https://doi.org/10.1038/s42256-019-0101-9
Perone, C. S., & Cohen-Adad, J. (2019). Promises and limitations of deep learning for medical image segmentation. Journal of Medical Artificial Intelligence, 2(0). http://jmai.amegroups.com/article/view/4659
Rajbongshi, N., Bora, K., Nath, D. C., Das, A. K., & Mahanta, L. B. (2018). Analysis of morphological features of benign and malignant breast cell extracted from fnac microscopic image using the pearsonian system of curves. Journal of Cytology, 35(2), 99–104. https://doi.org/10.4103/JOC.JOC_198_16
Rizwan I Haque, I., & Neubert, J. (2020). Deep learning approaches to biomedical image segmentation. Informatics in Medicine Unlocked, 18, 100297. https://doi.org/10.1016/j.imu.2020.100297
Sarica, A., Cerasa, A., & Quattrone, A. (2017). Random forest algorithm for the classification of neuroimaging data in alzheimer’s disease: A systematic review. Frontiers in Aging Neuroscience, 9. https://doi.org/10.3389/fnagi.2017.00329
Sathya, R., & Abraham, A. (2013). Comparison of supervised and unsupervised learning algorithms for pattern classification. International Journal of Advanced Research in Artificial Intelligence, 2(2). https://doi.org/10.14569/IJARAI.2013.020206
Scholl, I., Aach, T., Deserno, T. M., & Kuhlen, T. (2011). Challenges of medical image processing. Computer Science - Research and Development, 26(1–2), 5–13. https://doi.org/10.1007/s00450-010-0146-9
Shafique, S., & Tehsin, S. (2018a). Acute lymphoblastic leukemia detection and classification of its subtypes using pretrained deep convolutional neural networks. Technology in Cancer Research & Treatment, 17, 1533033818802789. https://doi.org/10.1177/1533033818802789
Shafique, S., & Tehsin, S. (2018b, February 28). Computer-aided diagnosis of acute lymphoblastic leukaemia [Review Article]. Computational and Mathematical Methods in Medicine. https://doi.org/https://doi.org/10.1155/2018/6125289
Shinde, P. P., & Shah, S. (2018). A review of machine learning and deep learning applications. 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), 1–6. https://doi.org/10.1109/ICCUBEA.2018.8697857
Wiharto, W., Suryani, E., & Putra, Y. R. (2018). Classification of blast cell type on acute myeloid leukemia (Aml) based on image morphology of white blood cells. TELKOMNIKA (Telecommunication Computing Electronics and Control), 17(2), 645. https://doi.org/10.12928/telkomnika.v17i2.8666
Xu, Y., & Goodacre, R. (2018). On splitting training and validation set: A comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. Journal of Analysis and Testing, 2(3), 249–262. https://doi.org/10.1007/s41664-018-0068-2
Additional Project Information
Research Plan:
Link to Research Plan (Google Document): https://docs.google.com/document/d/1SIC_RtRIEPlGwEy2w2lpNLVGKAYlhAKP-Yea4NBgtf4/edit?usp=sharing
Research Plan/Project Summary
Initial plan completed on 7/15; Addendum completed 8/1 (see end of document)
Project Name: Detection and Classification of Immature Leukocytes for Diagnosis of Acute Myeloid Leukemia
Student Researcher: Satvik Dasariraju
Adult Sponsor: Dr. Daniel Concepcion (Lawrenceville School)
Rationale:
Leukemia is an aggressive blood cancer that causes the growth of malignant white blood cells that damage the blood and blood marrow. The immature white blood cells prevent the functions of the bone marrow, including the production of red blood cells and platelets, thus making the immune system vulnerable. Acute myeloid leukemia (AML) is the deadliest of the four subtypes of leukemia, with a five-year survival rate of 28.7% (American Society of Hematology, n.d.). To detect and diagnose AML, microscopic examination of leukocytes from peripheral blood smears is performed to detect and confirm the presence of AML. Manual classification and counting of white blood cells only be performed by trained medical professionals and is time-consuming. Moreover, classification of leukocytes is prone to variation between medical examiners. Based on the discussed limitations of manual detection and classification, an automated system is required. (Matek et al., 2019).
Previous studies have achieved high accuracy on the classification of acute lymphocytic leukemia (ALL) into its subtypes (Shafique & Tehsin, 2018b), but AML has received less attention due to its high number of subtypes. Past research on AML classification based French American British (FAB) on has accurately classified up to three subtypes (Kazemi et al., 2016; Wilharto et al., 2018; Setiawan et al., 2018), while studies involving AML cell subtypes resulted in low precision for multi class prediction despite outstanding performance at distinguishing between immature (found in pathological conditions) and mature cells (Matek et al., 2019). The classification of AML cells holds great importance for the detection of the cancer because a complete white blood cell count is necessary to confirm leukemia (Shafique & Tehsin, 2018b). Treatment and prognosis of the cancer also depend on accurate classification of leukocytes in the peripheral blood, displaying the necessity for accurate multi class categorization. An algorithm capable of detection and classification of white blood cells can be used in a clinical setting to reduce the burden of medical experts and aid doctors in their diagnosis of AML.
Research Questions:
Note: immature leukocytes (erythroblasts, monoblasts, myeloblasts, metamyelocytes, myelocytes, and promyelocytes) are typically found in the peripheral blood stream only during pathological conditions.
- What is the capability of a Random Forest classifier (a type of machine learning classifier) to distinguish between classes in an unbalanced dataset?
- What is the capability of a machine learning algorithm to distinguish between subtypes with inter and intra class variability?
- What morphological features are most important in discriminating between immature and mature leukocytes?
- What morphological features are most important for classification of immature leukocytes?
- What new geometric calculations, beyond the standard features, are required for accurate classification of leukocytes?
Hypothesis:
If a Random Forest algorithm (a type of machine learning algorithm) is employed for the detection and classification of AML cell subtypes, then the performance of the model, which will be calculated with standard metrics including accuracy, precision, and AUC-ROC, will be superior to previous approaches due to the nature of a Random Forest to not overfit to data and perform well with unbalanced data (Sarica et al., 2017).
Engineering Goals:
- Main Experiment (Objective 1): To develop a Random Forest algorithm capable of detecting immature leukocytes in AML cell images and classifying the immature leukocytes by myeloid cell type
- Objective 2: To establish and display the 5 most important morphological features for the classification of leukocytes after the Random Forest algorithm is trained
- Objective 3: To propose, calculate, and prove the significance of several new geometric features for the classification of leukocytes (if possible). Due to the unique maturation of leukocytes, previous research has displayed the significance of new geometric features for classification of leukemia (Ghane et al., 2019).
- Objective 4: To create a system output that is better suited for a clinical environment by displaying the detection and classification of immature leukocytes, mature prognosis based on the classification, and the rationale for diagnosis (Note: see Addendum for revision to this objective)
Expected Outcomes:
Main Experiment (Objective 1): A Random Forest algorithm capable of…
- detecting immature leukocytes in AML with AUC-ROC above 0.9
- classifying immature leukocytes in AML with a global precision above 0.8
Objective 2: Identification of the 5 most important morphological features for the detection and classification of immature leukocytes in AML
Objective 3: A proposal of several new geometric features calculated specifically for leukocytes and proof that the features are significant
Objective 4: An output with the predicted classification and typical prognosis based on the classification
(Note: see Addendum for revision to this objective)
Materials:
- Personal computer (PC)
- The publicly available AML dataset at TCIA that was provided by Matek et al.: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=61080958
(Note: all the data is anonymous and de-identified)
- Jupyter notebooks for documentation and coding in the Python programming language
- Various open source Python packages including skimage, sklearn, numpy, and pandas.
Procedures:
Preprocessing and Segmentation:
- Download the images, abbreviations, and annotations from the dataset.
- Read in the images and start segmentation to isolate the leukocyte in each image.
- Create a features matrix based on geometric calculations of the morphological features of the leukocytes, the labels will be immature and mature.
- Divide the features matrix into a training set, validation set for optimization of the model, and testing set for unbiased evaluation of the model’s performance on new data.
Binary Classification Task (Detection of Immature Leukocytes):
- Train the Random Forest algorithm with the training set.
- Test the default-parameter Random Forest with the testing set.
- Evaluate the performance of the model with the following set of metrics: accuracy, precision, recall/sensitivity, specificity, ROC AUC).
- Plot the model’s performance and the baseline performance on a receiver operating characteristic (ROC) curve for visualization.
- Plot the model’s performance on a confusion matrix for visualization of predictions.
- Output the most important features for detection of immature leukocytes
- Optimize the hyperparameters of the Random Forest by randomly testing the performance of combinations of parameters on the training and validation sets and select the best combination of parameters.
- Manually tune any parameters that can clearly be improved.
- Test the optimized Random Forest with the testing set.
- Evaluate the performance of the optimized model with the following set of metrics: accuracy, precision, recall/sensitivity, specificity, ROC AUC).
- Plot the optimized model’s performance and the baseline performance on a receiver operating characteristic (ROC) graph for visualization.
- Plot the optimized model’s performance on a confusion matrix for visualization of predictions.
- Output the most important features for classification.
Immature Leukocyte Classification Task:
- Edit the features matrix, the labels will be the type of immature leukocyte this time (4 types will be used).
- Repeat steps 4-17 with the new features matrix, without ROC curves.
- Create the system’s output with diagnosis, classification, mature prognosis, and rationale.
Proposal of New Geometric Features (if possible):
- Propose 1-3 new geometric features specific to the morphology of leukocytes
- Display significance/insignificance of geometric features by calculating average of each feature for the different classes and calculating standard error (if 2SE ranges don’t intersect, the features are significant for discrimination).
Data Analysis:
Due to the multicomponent nature of the study, there will be numerous sets of variables and data analyses.
Main Experiment/Objective 1 Part A: Binary Classification (Immature vs. Mature)
Independent Variable: Selection and optimization of model (the default-parameter Random Forest model and the optimized Random Forest model will be compared with each other and the models of previous research)
Dependent Variable: Performance of model, measured by ROC-AUC
Control Variables:
- The training, validation, and testing sets will be the same for comparison of the default-parameter and optimized model.
- The same computer will be used for all models in this project.
- The Random Forest algorithm will be the same for all models in this project (from sklearn).
Main Experiment/Objective 1 Part B: Immature Leukocyte Classification
Independent Variable: Selection and optimization of model (the default-parameter Random Forest model and the optimized Random Forest model will be compared with each other and the models of previous research)
Dependent Variable: Performance of model, measured by precision for each class
Control Variables:
- The training, validation, and testing sets will be the same for comparison of the default-parameter and optimized model.
- The same computer will be used for all models in this project.
- The Random Forest algorithm will be the same for all models in this project (from sklearn).
Performance Metrics for Main Experiment/Objective 1:
- Accuracy- defined as the probability that an image is correctly identified (may be misleading):
(TP + TN)/(TP + FP + TN + FN)
- Precision- defined as the probability that an image claimed to be positive is truly positive:
(TP)/(TP + FP)
- Sensitivity/Recall- defined as the probability that a positive image was identified as positive:
(TP)/(TP + FN)
- Specificity- defined as the probability that a negative image was identified as negative (only for binary classification):
(TN)/(TN + FP)
- AUC-ROC: area under the curve of a receiver operating characteristic graph (only for binary classification)
Visualizations and Graphs for Main Experiment/Objective 1:
- The ROC curve of each model (default-parameter and optimized algorithms) will be plotted against a baseline model (a baseline model is one that predicts the positive class for all images).
- A confusion matrix will be plotted for each model to provide a visual representation of the predictions made and the true labels.
Objective 2: Identification of 5 Most Important Features for Detection and Classification
Independent Variable: Morphological feature
Dependent Variable: Importance of feature (see below for definition)
Control Variables:
- The same segmentation algorithm will be used prior to calculation of features.
- Each feature will have the same number of data points in the feature matrix.
- Each feature’s importance will be calculated using the same quantitative formula.
- The same computer will be used for the Random Forest model and all corresponding calculations.
Features will be ranked by Gini Importance (also called Mean Decrease Importance), which is defined as the decrease in Gini Impurity caused by the feature in the Random Forest. Gini Impurity is the probability at a given node that an object will be categorized incorrectly.
Objective 3: Proposal of New Geometric Features and Proof of Significance
(Note: see Addendum for data analysis revision)
Independent Variable: Proposed geometric feature
Dependent Variable: Average value of geometric feature across subtypes
Control Variables:
- Each new geometric feature will be calculated with the same quantitative formula for all the images.
- Each new feature will have the same number of data points in the feature matrix.
- The same segmentation algorithm will be used prior to calculation of new features.
- The same computer will be used for all calculations.
The proposed, new geometric features will be identified as being significant if the averages of the feature value in each class do not have overlapping 2SE ranges. Standard error is calculated as:
(standard deviation)/(square root of number of samples)
Types of Visualizations and Graphs:
A box plot will be graphed for the new geometric features to display the differences in the geometric features across the classes.
Objective 4: Output Suited for Clinical Setting
No data analysis is required for this objective.
Risk and Safety Evaluation:
There are no major hazards associated with the conducting of the project because all of the set up and experimentation will be done on a computer. Potential risks associated with using a computer are the common hazards of the internet. I will protect myself from these hazards by only downloading data and journal articles from secure sources, not communicating with strangers, and not revealing any personal information on the internet.
Addendum:
No changes were made to the procedure or data analysis for Objectives 1A and 1B (detection and classification of immature leukocytes). There were no changes to Objective 2 (identification of most important features for detection and classification). For Objective 3 (proposal of new geometric features and proof of significance), the data analysis method was revised. The two proposed features, average and standard deviation of nucleus color intensity in B Channel of LAB color space, were shown to be discriminative with Gini Importance instead of using standard error bars because that is the standard practice when using decision tree or Random Forest machine learning classifiers. A box plot was not necessary because a table with Gini Importance values sufficed to display which features were discriminative. Objective 4 was omitted from this study because it is not possible to assign a prognosis based on the morphology of one cell. It is worth noting that I am the lead author of a peer reviewed research paper based on my project, which has been published in the Bioengineering Journal and has been deposited in PubMed (link: https://doi.org/10.3390/bioengineering7040120)
Questions and Answers
Project Questions
1. What was the major objective of your project and what was your plan to achieve it?
The main objective of my project was to develop a machine learning model based on the cytomorphological features of leukocytes that is capable of detecting immature leukocytes, which are a sign of acute myeloid leukemia (AML), and classifying the immature leukocytes by myeloid cell type. A secondary objective was to calculate and rank the most important cytomorphological features of leukocytes for the detection and classification of immature leukocytes.
My plan to achieve the objective consisted of 4 main steps. First, during the processing and segmentation phase, I acquired images from a publicly available dataset in The Cancer Imaging Archive (Clark et al., 2013; Matek et al., 2019). I segmented the nucleus and cytoplasm of each leukocyte image through a combination of image format conversions, multi-Otsu thresholding, and image operations such as erosion and dilation. Second, during the feature extraction phase, I selected 12 shape features and 2 color features that are known to differentiate different types of leukocytes based on the properties of leukocyte maturation. I also proposed 2 new color features that I hypothesized would be discriminative based on the nature of leukocytes to appear more granular as they mature. I created algorithms to extract these 16 features from the isolated nucleus and cytoplasm of each leukocyte image. Third, during the machine learning phase, I trained, optimized, and evaluated a Random Forest model for binary classification between mature and immature leukocytes. I also trained, optimized, and evaluated a Random Forest model for multi-class classification of immature leukocytes into four types. Fourth, I calculated the Gini Importance of each feature in the Random Forest algorithms to determine and rank each cytomorphological feature’s importance.
a. Was that goal the result of any specific situation, experience, or problem you encountered?
I formed my goal for the project after learning about the limitations of the manual diagnosis of AML. The standard method of diagnosing AML involves a microscopic examination of the peripheral blood smear during which the clinician must classify and count each type of leukocyte. As a result, the manual diagnosis is labor intensive and time consuming (Prinyakupt & Pluempitiwiriyawej, 2015). Furthermore, manual examination is prone to biases such as the tiredness of the clinician and as a variation rate of 30-40% (Kazemi et al., 2016). Due to this, I believed an automatic, machine learning approach to diagnosing AML could address the shortcomings of the manual diagnosis.
b. Were you trying to solve a problem, answer a question, or test a hypothesis?
In addition to attempting to overcome the limitations of the standard diagnosis of AML with an automatic, machine learning approach, I was trying to test a hypothesis that answered a question. The main question was “what is the capability of a Random Forest classifier to distinguish between classes in an unbalanced dataset and distinguish between classes with inter and intra class variability”? My main question was primarily concerned with imbalance of data because the dataset I was using was unbalanced across classes of immature leukocytes (reflecting the data imbalance found in a clinical setting). My hypothesis that answered this question and was tested in my project was: “if a Random Forest algorithm (a type of machine learning algorithm) is employed for the detection and classification of AML cell subtypes, then the performance of the model, which will be calculated with standard metrics including accuracy, precision, and AUC-ROC, will be superior to previous approaches due to the nature of a Random Forest to not overfit to data and perform well with unbalanced data”. A secondary question I tried to answer in my project was “what morphological features are most discriminative for the detection and classification of immature leukocytes”?
2. What were the major tasks you had to perform in order to complete your project?
First, before any coding, I conducted an extensive literature review on computer-assisted diagnosis, bioinformatics, and machine learning before narrowing my focus to detecting and classifying AML due to the limitations in the manual diagnosis of the disease. I then read many journal articles about the molecular underpinnings of AML, biomedical image segmentation, feature extraction, and the Random Forest Algorithm. After writing up a research plan and getting it approved by my science teacher (who is my adult sponsor) and the mentors I was working with, I was ready to proceed with carrying out the project.
I coded the project in a Jupyter Notebook in the Python programming language. The first step was image processing and segmentation, during which I obtained images of leukocytes in peripheral blood smears from a publicly available dataset. I also processed the images and segmented them using image format conversion, multi-Otsu thresholding, and image operations such as erosion and dilation. The end product of the processing and segmentation phase was a segmented binary mask of the nucleus and the cytoplasm (essentially an isolated form of the nucleus and the cytoplasm without any color or noise) for each leukocyte. The second phase was feature extraction. After reading extensively about the biology and cytomorphology of leukocytes, I selected 12 shape features, 2 color features, and 2 novel proposed features to extract from each leukocyte to utilize during machine learning. I developed algorithms to extract each of the 16 cytomorphological features from each of the leukocytes to create a feature matrix that could be utilized for machine learning. The third phase was classification, in which I trained, optimized, and evaluated my machine learning models (utilizing Random Forest algorithm) for detecting and classifying immature leukocytes. I trained the model with 80% of the data for detection of immature leukocytes (and reserved 20% of the data to test the model on instances it had not previously seen. For classification of immature leukocytes, I trained the model with 70% of the data and reserved 30% of the data for the test set. I optimized the models through a random search of hyperparameters and I selected the best parameters based on 5-fold cross validation accuracy score. I evaluated the models with accuracy, precision, recall, and area under the curve of the receiver operating characteristic curve. For the fourth and last phase of my project, I calculated the Gini Importance of each feature to determine and rank the feature importance of each feature. Nucleus to cytoplasm area was found to be a discriminative feature for both detection and classification of immature leukocytes.
a. For teams, describe what each member worked on.
I am submitting this project to Mercer Science and Engineering Fair alone (not a team project). I completed the project, did all the coding, prepared the presentations, and wrote the paper entirely by myself with guidance from Marc Huo (a Stanford undergraduate majoring in Biomedical Computation) and Dr. Serena McCalla (the director of the summer program I did the project in, iResearch Institute). Marc Huo and Dr. Serena McCalla also assisted in the submission and revision of my paper during the publication process.
3. What is new or novel about your project?
a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?
One novel aspect of the project is its secondary objective: the calculation of feature importance (Gini Importance) of a variety of cytomorphological features for detection and classification of immature leukocytes. Previous studies lack a calculation of feature importance for a variety of both shape and color features for the diagnosis of AML.
The results of my projects are also new, as the accuracy for detection of immature leukocytes (92.99%) is on par with the current state of art, while the precision values for all classes during classification of immature leukocytes are all above 65%, which is an improvement over the state-of-art.
b. Is your project's objective, or the way you implemented it, different from anything you have seen?
In addition to the calculation of the feature importance for a variety of cytomorphological features and the high results, two of the color features proposed in my project are novel (nucleus average color intensity in B channel of LAB color space and nucleus standard deviation of color intensity in B channel of LAB color space). These features are supported by the biology of leukocytes because as leukocytes mature, they appear more granular (this effect is captured by the average and standard deviation of the nucleus’s color intensity). The two new proposed color features were shown to be discriminative based on a calculation of Gini Importance.
c. If you believe your work to be unique in some way, what research have you done to confirm that it is?
The three components of my project mentioned above (calculation of feature importance for a variety of cytomorphological features, results superior to the state-of-art, and two novel color features) are novel to the best of my knowledge. I have confirmed this through an extensive literature review of 21 papers (cited in the bibliography of my research plan) as I planned my project and the 41 papers I drew information from during my project (cited in my published research paper).
4. What was the most challenging part of completing your project?
The most challenging part of my project was developing robust, versatile image segmentation methods to isolate the nucleus and cytoplasm of each leukocyte image. Image segmentation proved to be difficult because of noise and small differences in each image.
a. What problems did you encounter, and how did you overcome them?
During image segmentation, I encountered two main problems. First, some images contained a leukocyte but were also surrounded by overlapping, background cells (unstained). This led to difficulty in creating a robust algorithm that could isolate the centered leukocyte that was the region of interest in the image. I was stuck on this problem for over a week because each time I modified the segmentation algorithm to work for a certain type of image, the algorithm would fail for other images. Eventually, I read many more papers on biomedical image segmentation, particularly for white blood cells, and learned about erosion and dilation (which are image morphological operations that can widen and close gaps between parts of the image). I implemented these operations and they made the algorithm work for nearly all the images. Second, the segmentation algorithm failed for some images where staining obscured the cell. These images were removed from the next phase because they were unusable.
Throughout the project, there were numerous instances where I had to debug the code I’d written or reevaluate my approach towards a certain computational task. I overcame these bugs by thinking about the direct function performed by each line of code I had written and determining where problems existed. Debugging was made easier thanks to the detailed documentation (comments detailing what each line of code did) I had provided because I could easily figure out if a script wasn’t performing the intended task.
b. What did you learn from overcoming these problems?
As I overcame the difficulty of segmenting the nucleus and cytoplasm of each leukocyte, I learned the importance of creating versatile algorithms that work for all or most images in the dataset, not just for one type. I also learned that during science research, plans often do not work the way they are intended. For this reason, it is important to constantly refer back to journal papers and learn from previous methodologies when attempting to complete a difficult task. Especially since I was developing new methods in my project, I found that referring back to previous papers was crucial
I also learned the importance of documenting work through detailed notes accompanying code because this aids when debugging. Documentation also ensures work is reproducible and easily understood.
5. If you were going to do this project again, are there any things you would do differently the next time?
If I were to do the project again, one thing I would improve is feature engineering. While I selected 16 features that are known to be discriminative for the detection and classification of immature leukocytes based on the biology of leukocytes and properties of leukocyte maturation, I believe feature engineering after feature extraction could have significant benefits. For example, I could explore feature binning/grouping to determine which features of leukocyte cytomorphology work together to predict leukocyte type.
Another thing I could do differently is improve the optimization of my machine learning model. Although I utilized a grid search based on 5-fold cross validation score to tune the hyperparameters of the Random Forest algorithm, I could explore further optimization methods to see if they would result in an improvement of the model’s performance for detection and classification of immature leukocytes.
Finally, I opted for employing the Random Forest algorithm due to its advantages compared to other machine learning algorithms for imbalanced data. If I were to do the project again, I could test other machine learning algorithms or other forms of machine intelligence (eg. genetic/evolutionary algorithms) to see how those algorithms fared in comparison to the Random Forest algorithm I used.
6. Did working on this project give you any ideas for other projects?
Working on this project gave me numerous ideas for future projects, many of which I am currently trying out. First, during the feature extraction phase of my project, I noticed that for some features had rare variants (eg. the nucleus area was below a certain number for almost all leukocyte images, but it was above the number for very few leukocyte images). I was wondering if these rare variant features could be informative for prediction, given that their rare status means they wouldn’t have much of an effect on prediction when considered alone. I am currently interning at a lab at the Perelman School of Medicine at University of Pennsylvania and developing a prototype of a rare-variant data binner that utilizes an evolutionary approach to grouping rare variant features. Grouping rare variant features could produce crucial information that isn’t typically available for machine learning algorithms predicting classifying healthy and diseased patients.
While I was working on this project, I also thought of developing semi-automatic systems for image segmentation that can be completely integrated into a clinical setting. The algorithm could present different possibilities for the segmented nucleus and cytoplasm for each cell, then a clinician would confirm the correct segmentation and the model would give its prediction, which could aid the clinician in his or her final diagnostic decision. I hope to pursue this project further in the near future because of the potential of a machine learning system that can be integrated in a clinical setting to overcome the limitations of manually diagnosing AML and help many people.
7. How did COVID-19 affect the completion of your project?
The coronavirus pandemic shifted my project to a virtual format. Since the project primarily involved computer programming, this was not an issue.
Bibliography
Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M., Tarbox, L., & Prior, F. (2013). The Cancer Imaging Archive (TCIA): Maintaining and Operating a Public Information Repository. Journal of Digital Imaging, 26(6), 1045–1057. https://doi.org/10.1007/s10278-013-9622-7
Kazemi, F., Najafabadi, T., & Araabi, B. (2016). Automatic recognition of acute myelogenous leukemia in blood microscopic images using K-means clustering and support vector machine. Journal of Medical Signals & Sensors, 6(3), 183. https://doi.org/10.4103/2228-7477.186885
Matek, C., Schwarz, S., & Spiekermann, K. (2019). A Single-cell Morphological Dataset of Leukocytes from AML Patients and Non-malignant Controls. The Cancer Imaging Archive. https://doi.org/10.7937/tcia.2019.36f5o9ld
Prinyakupt, J., & Pluempitiwiriyawej, C. (2015). Segmentation of white blood cells and comparison of cell morphology by linear and naïve Bayes classifiers. BioMedical Engineering OnLine, 14(1), 63. https://doi.org/10.1186/s12938-015-0037-1