PneumoStack: A Novel Approach to Pneumonia and COVID-19 Diagnosis with Automated Chest X-ray Analysis via Stacked Generalization and Convolutional Neural Networks

Student: Sonya Jin
Table: MED1223
Experimentation location: Home
Regulated Research (Form 1c): No
Project continuation (Form 7): No

Display board image not available


Pneumonia is the single largest infectious cause of death in children worldwide, accounting for 15% of all deaths of children under 5 years old. Regarding the current pandemic, chest X-ray (CXR) analysis is needed to rectify false negatives from RT-PCR in COVID diagnosis, emphasizing the need to improve diagnostic accuracy. As CXRs are the principal diagnostic tool for pneumonia, automating medical image analysis with medical image classification can aid radiologists in expediting and improving the diagnostic process. Research in deep learning for medical image analysis has utilized individual transfer learning neural networks as well as neural network ensembles constructed by means such as bootstrap aggregation. This study presents a novel stacked model for CXR analysis composed of three CNN architectures: InceptionResNetV2, Xception, and ResNet50. All three pre-trained models were trained on a chest X-ray dataset for binary classification and multi-class classification and ensembled via stacked generalization into a neural network meta-learner. The proposed stacked model (Pneumostack) achieved an accuracy of 95.4% in three-category classification (COVID-19, non-COVID pneumonia, and normal) and 99.8% in binary classification (normal and pneumonia), outperforming any one of its single constituent classifiers and other models presented in current literature. Surpassing existing transfer learning models and ensembles, Pneumostack opens doors to higher performance in automated CXR analysis and other CNN applications in medicine.



  1. Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. Retrieved 31 January 2021, from
  2. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. Retrieved 31 January 2021, from
  3. Chollet, F. (2016). Xception: Deep Learning with Depthwise Separable Convolutions. Retrieved 31 January 2021, from
  4. Bai, H., Hsieh, B., Xiong, Z., Halsey, K., Choi, J., & Tran, T. et al. (2020). Performance of Radiologists in Differentiating COVID-19 from Non-COVID-19 Viral Pneumonia at Chest CT. Radiology, 296(2), E46-E54. doi: 10.1148/radiol.2020200823
  5. Sharma, N., Jain, V., & Mishra, A. (2018). An Analysis Of Convolutional Neural Networks For Image Classification. Procedia Computer Science, 132, 377-384. doi: 10.1016/j.procs.2018.05.198
  6. Zhang, J. (1999). Developing robust non-linear models through bootstrap aggregated neural networks. Neurocomputing, 25(1-3), 93-113. doi: 10.1016/s0925-2312(99)00054-5
  7. Pasa, F., Golkov, V., Pfeiffer, F., Cremers, D., & Pfeiffer, D. (2019). Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization. Scientific Reports, 9(1). doi: 10.1038/s41598-019-42557-4
  8. Nishio, M., Noguchi, S., Matsuo, H., & Murakami, T. (2020). Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: combination of data augmentation methods. Scientific Reports, 10(1). doi: 10.1038/s41598-020-74539-2
  9. Narin, A., Kaya, C., & Pamuk, Z. (2020). Automatic Detection of Coronavirus Disease (COVID-19) Using X-ray Images and Deep Convolutional Neural Networks. Retrieved 5 February 2021, from
  10. Pneumonia. (2021). Retrieved 1 January 2021, from
  11. Pneumonia. (2019). Retrieved 1 January 2021, from
  12. FastStats. (2021). Retrieved 1 January 2021, from
  13. Team, K. (2021). Keras documentation: Keras API reference. Retrieved 5 February 2021, from
  14. Siddiqi, R. (2021). Automated Pneumonia Diagnosis using a Customized Sequential Convolutional Neural Network | Proceedings of the 2019 3rd International Conference on Deep Learning Technologies. Retrieved 5 February 2021, from,rapid%20diagnosis%20of%20the%20pathology.&text=A%20publicly%20available%20pediatric%20chest,and%20testing%20of%20the%20model.
  15. Deshmukh, Hardik. (2020). Medical X-ray Image Classification using Convolutional Neural Network. (2020). Retrieved 5 February 2021, from
  16. Osterburg, Stephen. “Implementation of Xception Model.” Coding, 
  17. Aguas, Kenneth. (2020). A guide to transfer learning with Keras using ResNet50. Retrieved 5 February 2021, from
  18. Brownlee, J. (2018). Stacking Ensemble for Deep Learning Neural Networks in Python. Retrieved 5 February 2021, from
  19. Izadyyazdanabadi, M., Belykh, E., Mooney, M., Martirosyan, N., Eschbacher, J., & Nakaji, P. et al. (2018). Convolutional neural networks: Ensemble modeling, fine-tuning and unsupervised semantic localization for neurosurgical CLE images. Journal Of Visual Communication And Image Representation, 54, 10-20. doi: 10.1016/j.jvcir.2018.04.004
  20. Guo, Y., Wang, X., Xiao, P., & Xu, X. (2019). An ensemble learning framework for convolutional neural network based on multiple classifiers. Soft Computing, 24(5), 3727-3735. doi: 10.1007/s00500-019-04141-w
  21. X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri and R. M. Summers. (2017). ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI. 3462-3471. doi: 10.1109/CVPR.2017.369
  22. Nishio, M., Noguchi, S., Matsuo, H., & Murakami, T. (2020). Automatic classification between COVID-19 pneumonia, non-COVID-19 pneumonia, and the healthy on chest X-ray image: combination of data augmentation methods. Scientific Reports, 10(1). doi: 10.1038/s41598-020-74539-2
  23. RSNA Pneumonia Detection Challenge (2018). (2021). Retrieved 6 February 2021, from


Additional Project Information

Project website: -- No project website --
Project web pages: -- No webpages provided --
Presentation files:
Research paper:
Additional Resources: -- No resources provided --
Project files:
Project files

Research Plan:


  1. Rationale

Pneumonia is the single largest infectious cause of death in children worldwide (WHO). Early detection can reduce child mortality rates in regions where pneumonia is most prevalent - South Asia and sub-Saharan Africa. Automating pneumonia diagnosis can spread accurate and efficient diagnostic measures to areas where trained radiologists are limited in these areas. Furthermore, recent reports have revealed that RT-PCR has a sensitivity as low as 60%–71% for helping detect COVID-19. These false-negative findings have the potential to overload the current supply of testing kits and other supplies that can be better allocated to true-positive patients. In contrast, chest X-rays (CXRs) have demonstrated about 69% sensitivity in the detection of COVID-19 at initial presentation and can be helpful in rectifying these false-negative findings obtained with RT-PCR during the early stages of disease development, so it’s imperative to improve CXR accuracy in the presence of COVID-19 pneumonia. Convolutional neural networks (CNNs) have shown unsurpassed success in varying image classification tasks. As CXRs are the principal diagnostic tool for pneumonia, automating medical image analysis can aid radiologists in expediting and improving the diagnostic process in time and accuracy. Ensemble learning is a method used to maximize detection performance by combining the results of single constituent algorithms. Furthermore, transfer learning for image classification is an approach that utilizes weights from source models trained on millions of images on ImageNet and fine-tuned, which results in a higher accuracy than without the use of pre-trained weights. As these source models can recognize generalized features, transfer learning would minimize training time and computational costs, while maximizing performance on the small CXR dataset analyzed. Stacked generalization allows the ensembling of different transfer learning models, allowing the wringing of benefits from each individual model architecture.  By ensembling transfer CNNs through stacked generalization, a novel pneumonia classification algorithm can be developed for chest X-ray analysis, ideally outperforming current models and improving the accuracy and efficiency of pneumonia and COVID-19 diagnosis.

  1. Aim

  Present a stacked neural network meta-learner of transfer learning CNNs with stacked generalization in effort to achieve higher performance than any one of its constituent classifiers and existing individual models in binary and multiclass pneumonia CXR classification.

  1. Research Questions
  1. What are the optimal CNN models to ensemble to maximize accuracy in the applied image classification algorithm? 
  2. How can automating pneumonia diagnosis with machine learning enhance or outperform current diagnosis methods used?
  3. Does this proposed stacked model outperform other models presented in current literature for this task of medical imaging analysis?
  4. Why convolutional neural networks as opposed to other neural networks and algorithms for this predictive modeling problem? Are convolutional neural networks the best individual classifier algorithm to use?
  5. How can performance be maximized using this method (adjustment of hyperparameters, augmentation methods, etc.)?
  1. Procedures
  1. Data pre-processing:
    1. Dataset selection: multiple sources of data will be selected from the NIH Clinical Center Chest X-Ray database and combined into a large dataset for training. Data will be cleaned by removing non-viral and non-bacterial pneumonia from the data (non-tertiary pneumonia), and rearranging class labels into a single column for direct class stratification.
    2. Data importation: .csv file will be created with paths and labels for every CXR scan and will be imported into Google Colaboratory.
    3. Required packages and libraries for model implementation and data visualisation will be imported into Google Colaboratory - Sklearn, Keras, Tensorflow, Pandas, Numpy, Matplotlib, and Seaborn.
    4. Train test split and data stratification: The data will be split into test and train data via the Scikit-Learn train-test procedure and stratified according to class label.
    5. RGB value extraction: A function will be written to iterate through the path of every image and extract its RGB array. 
    6. The ImageDataGenerator augmentation method will be used to augment the data with the keras and keras-preprocessing libraries. 
      1. Two generators will be built with basic color normalizing augmentation using keras-preprocessing for training and testing data, respectively.
    7. Model Evaluation:
      1. Compute the F-1 score, ROC-AUC, accuracy, and confusion matrix values with roc_curve, roc_auc_score, precision_recall_curve, f1_score, and auc functions from the sklearn.metrics module.
      2. For three-category classification, compute accuracy and confusion matrix values with sklearn.metrics. 
      3. Validation accuracy and loss visualization: plot the AUC and ROC curves, change in FPR/TPR, and confusion matrix with Matplotlib. 
  2. Constituent model construction, training, and evaluation
    1. Xception model:
      1. Write a function to instantiate and implement Xception architecture with depthwise separable convolution blocks. Create model with Keras Models API.
      2. Refer to Training of Sequential CNN (2.2.1-2.2.5) for training the Xception model. For 2.2.3, load weights from ImageNet with the Keras Xception function.
      3. Fine-tune the model with layer freezing procedure.
      4. Perform model evaluation with performance metric computation and visualization. Refer to 2.3 Model Evaluation.
    2. ResNet50 model:
      1. Repeat steps 3.1.1.-3.1.4 with ResNet50 architecture. Load the respective ImageNet weights for ResNet50 with the Keras ResNet50 function when referring to 3.1.2.
    3. InceptionResNetV2 model:
      1. Repeat steps 3.1.1.-3.1.4 with InceptionResNetV2 architecture. Load the respective ImageNet weights for InceptionResNetV2 with the Keras InceptionResNetV2 function when referring to 3.1.2.
  3. Stacked Generalization Ensemble construction, training, evaluation
    1. Load the three saved transfer learning models with Keras and append to list.
    2. Define a function to return a stacked generalization neural network model with model list as input.
      1. Freeze layers in all transfer models with iteration.
      2. Merge outputs of models with concatenation merge.
      3. Define a hidden and output Dense layer with the Keras Dense class.
    3. Train ensemble by fitting the model based on the outputs from constituent models.
    4. Evaluate model on test set. Refer to 2.3 Model Evaluation.
    5. Repeat with three-category classification by replacing output layers to Softmax and replacing number of inputs in Flattening layer. Alter confusion matrix code to support multiclass classification.
  4. Model Alteration
    1. Modification of data augmentation methods: Data augmentation will be altered by adjusting the variables in ImageDataGenerator until performance is maximized. Model will be run with altered basic color normalizing augmentation and re-run with complex augmentation (zoom, rotate, crop). Proceed with augmentation method and variables with best model performance metrics.
    2. Adjust hyperparameters in the constituent models. Re-run and re-evaluate constituent models and meta-learner to analyze improvement in performance. 


Questions and Answers


1. What was the major objective of your project and what was your plan to achieve it?

The main objective is to present a model that outperforms existing models in pneumonia and COVID-19 diagnosis via chest X-rays (CXRs) in binary (pneumonia vs. healthy) and ternary (healthy, COVID-19, tertiary pneumonia). My plan involved using an ensembling method - stacked generalization - and transfer learning to achieve a higher performance than models presented in current research on automated chest X-ray analysis for COVID-19 and pneumonia detection. I planned to ensemble three transfer learning models that were picked based on their performances in relevant studies as well as their performances on the training data: Xception, InceptionResNetV2, and ResNet50. 

    a. Was that goal the result of any specific situation, experience, or problem you encountered?  

This goal arose considering my experience in the current pandemic of COVID-19. The need is more pressing than ever to contain infection and to accurately diagnose individuals to prevent the spread of the disease. 

    b. Were you trying to solve a problem, answer a question, or test a hypothesis?

  I was trying to both solve problems and test a hypothesis. 

One problem is that reverse-transcription polymerase chain reaction (RT-PCR), the gold standard diagnostic test for COVID-19, has a sensitivity as low as 60%–71% for helping detect COVID-19, while CXRs have a sensitivity of 69%. This presents the possibility for CXR analysis rectifying false-negative findings in RT-PCR in COVID-19 diagnosis, emphasizing the need to increase CXR diagnostic accuracy in COVID-19. Additionally, for all chest CT scans (n = 424), the accuracy of the two radiologists from China in differentiating COVID-19 from non-COVID-19 viral pneumonia was 80% (338 of 424) and 60% (255 of 424), emphasizing the need for AI applications in medical image analysis to minimize human error.

Another problem is the need to increase pneumonia CXR diagnostic accuracy. By automating pneumonia CXR analysis, diagnosis tools can be brought to locations where pneumonia is prevalent and where trained radiologists are not always available or are limited, such as in sub-Saharan Africa and South Asia. Pneumonia still remains the #1 infectious cause of death for children under 5 years old in the world, concentrated in areas without access to trained radiologists.

Further, the hypothesis tested in this study was that a convolutional neural network (CNN) meta-learner constructed with stacked generalization from multiple optimal CNN architectures would achieve a higher accuracy than already proposed methods, paving the way for higher performance in automated CXR analysis.

2. What were the major tasks you had to perform in order to complete your project?

    First, I had to do the data preprocessing, which included the splitting of training and testing data, class stratification, data cleaning, and augmentation with sci-kit learn. Then, I constructed three transfer learning models (Xception, InceptionResNetV2, ResNet50, models picked based on performance on training data and past studies for similar tasks) with Keras, fine-tuned the transfer learning model with the layer freezing procedure, and loaded weights off of ImageNet. Next came the model training: I trained each model on the training data with callbacks to prevent overfitting. Then, I constructed a CNN meta-learner and a stacked dataset with the outputs of the trained models. The stacked model was trained on the stacked dataset. Finally, the last major task was to evaluate the models using binary classification performance metrics (F1, ROC-AUC, accuracy) and three-category classification metrics (precision, recall, accuracy), and visualize the results.

3. What is new or novel about your project?

My approach is novel in the sense that it uses the state-of-the-art image classifier algorithm, the convolutional neural network (CNN), as the meta-learner through stacked generalization and presents a new ensemble of CNN architectures. My approach, in contrast to other proposed methods, uses both ensemble and transfer learning, reaping the benefits of multiple transfer learning model architectures and leading to higher performance. 

    a. Is there some aspect of your project's objective, or how you achieved it that you haven't done before?

I’ve actually never worked with transfer learning models and ensembling techniques. I’ve also never worked with convolutional neural networks in general, including construction, training, and evaluation.

    b. Is your project's objective, or the way you implemented it, different from anything you have seen?

My novel ensemble, consisting of Xception, InceptionRestNetV2, and ResNet50, has never been implemented before in any CNN application. Further, existing models used for this task of pneumonia and COVID-19 detection in chest X-ray analysis are individual transfer learning models or ensembles constructed by other means such as bootstrap aggregation, boosting, soft voting, or hard voting. In the event that stacked generalization is used, the meta-learner is constructed via logistic regression or SVM, not a neural network. My meta-learner is a sequential CNN.

    c. If you believe your work to be unique in some way, what research have you done to confirm that it is?

CNNs have been at the forefront of automated CXR analysis research in effort to detect pneumonia and COVID-19, but these models have been limited to individual transfer learning source models and ensembles constructed via bootstrap aggregation and voting. Wang et al. [16] constructed COVID-Net, a tailored CNN for the detection of COVID-19 with a projection-expansion-projection-extension (PEPX) design pattern. With three classes (non-COVID pneumonia, COVID-19, normal), the model achieved an accuracy of 93.3%. Apostolopoulos et al. [17] used VGG-19 as a base model for three classes and achieved an accuracy of 87%. Umer et al. [18] proposed COVINet, a CNN approach with three convolutional layers, a max pooling layer, an average pooling layer, and four FC layers. COVINet achieved an accuracy of 89.9% with three classes. Nishio et. al. [19] used VGG-16 for the detection of three classes and achieved an accuracy of 83.68% with a combination of data augmentation methods - conventional and mixup. For binary classification, many approaches were proposed, such as the MADE-based CNN [20] with 92.55% accuracy, Deep CNN [23] with 93% accuracy, and a weighted voting ensemble [33] with a 72.26% accuracy.

*Comparison chart found in presentation files

4. What was the most challenging part of completing your project?

The most challenging part was to learn all of the frameworks and languages needed for this project. This was my first research project involving machine learning and computer science, and I only learned Python this summer. A substantial amount of the ~500 hours spent on this project was reading literature, documentation, and tutorials. I frequently pulled all-nighters to ensure I could meet the deadline because I was so new to this field and needed to learn advanced techniques to implement in this project. Further, since I was working with a novel ensemble, it was difficult to find literature and tutorials on stacked generalization with a neural network as a meta-learner with transfer-learning models as base models. Because of this, I had no direct answer to my errors found on the internet. I needed to acquire relevant knowledge to conjure creative ways to solve the errors on my own.

   a. What problems did you encounter, and how did you overcome them?

The dataset I used had many labels, and I needed a specific number of subclasses. The .csv was not labeled in a way for clear stratification without modifying the dataset. I had to learn Pandas for data-preprocessing, clean the data with iteration, and construct a function to stratify the class labels. Additionally, I kept getting errors when implementing stacked generalization. I frequently stared at one error for over five hours without knowing how to fix it, as the error was not on Stack Overflow (with a relevant task). I had to read several cases of the error in irrelevant tasks to achieve an understanding of what might be wrong, sift through my own code, and experiment by adjusting hyperparameters, debugging by printing inputs and outputs, etc. These errors occurred at the end of my code at the definition and training of the stacked model, after the training of the individual transfer learning models, so I had to solve the error immediately or else I would disconnect from the server and re-train (a 5-hour process). This process was very time-consuming and demanding. Frequently, I stayed up until 5 am or later debugging because I did not have the time to re-train all of the models the next day, only to wake up at 7 am for class. 

   b. What did you learn from overcoming these problems?

Ultimately, I relied on my resilience and grit to overcome these problems. After being faced with multiple errors that I stared at the day before for multiple hours, I woke up the next day, re-trained, and worked on them again. I learned to have patience, to not give up, and to have faith that my hard work will be worth it in the end. I now also know the fundamentals of machine learning from my extensive literature and documentation review, despite starting this project with absolutely no machine learning knowledge.

5. If you were going to do this project again, are there any things you would do differently the next time?

The dataset was heavily imbalanced. If I had more time, I would’ve used the synthetic minority oversampling technique (SMOTE) to balance the dataset first with synthetic generation of the minority class, and then proceed to split the training and testing data. With SMOTE, I may have reached even higher performance in three-category classification. Furthermore, I would’ve replaced ResNet50 with ResNet102, which would lead to higher constituent model performance and may lead to higher stacked model performance.  ResNet50 was used in this study to minimize computational training time and expense.

6. Did working on this project give you any ideas for other projects? 

To investigate if this superior performance over other proposed methods in automated CXR analysis projects onto other applied CNN tasks in medicine, Pneumostack may be used in other medical tasks such as MRI analysis for the early detection of neurodegenerative disease, differential gene analysis, and biomarker identification.

7. How did COVID-19 affect the completion of your project?

COVID-19 actually inspired my project to take a turn. Originally, I worked on pneumonia diagnosis as there was extensive published literature on automated pneumonia diagnosis via CXR analysis. However, after coming across a dataset with COVID-19 CXRs, I included COVID-19 diagnosis in my project.