This project evaluates the effectiveness of transformer-based models—Vision Transformer (ViT), Swin Transformer, and BEiT—for automated diagnosis using chest X-ray images from the CheXpert dataset.
We systematically compare these architectures by assessing accuracy, loss, and interpretability through attention map visualization. Our experiments reveal that BEiT achieves superior generalization performance, consistently outperforming ViT and Swin on unseen data. Further analysis demonstrates that models trained on substantially smaller datasets (11 GB versus over 400 GB) retain comparable diagnostic accuracy, significantly reducing computational resources. Attention visualization confirms that BEiT exhibits superior localization of clinically relevant regions, enhancing interpretability and clinical trustworthiness.
These results position BEiT as a robust and resource-efficient architecture for medical image analysis tasks, highlighting the importance of comprehensive benchmarking in model selection for clinical applications.
We used the CheXpert-v1.0-small dataset, a downsized version of the original CheXpert dataset. It contains 224,316 chest radiographs from 65,240 patients, labeled with 14 clinical observations.
The labels were generated using an automated labeling system capable of detecting and classifying findings, including those with inherent uncertainty.
To ensure labeling reliability, a validation set of 200 studies was manually annotated by three board-certified radiologists.
The Vision Transformer (ViT) applies the standard Transformer architecture directly to image patches, treating them as sequences similar to words in natural language. It splits the input image into fixed-size patches and processes them with self-attention mechanisms. ViT is known for its simplicity and scalability, performing well with large datasets but requiring significant data and compute resources to outperform CNN-based models.
BEiT builds upon ViT by introducing a pretraining strategy similar to BERT in NLP. It treats image patches as discrete visual tokens and learns bidirectional representations using a masked image modeling objective. This enables the model to better capture contextual relationships within the image, significantly improving performance in downstream tasks, especially when labeled data is limited.
The Swin Transformer introduces a hierarchical architecture that processes images through non-overlapping local windows with shifted configurations across layers. This design enables both local and global representation learning while maintaining computational efficiency. Its ability to model long-range dependencies and multi-scale features makes it particularly effective for dense prediction tasks such as detection and segmentation.
The graph presents a comparative evaluation of three transformer-based models: Vision Transformer (ViT), Swin Transformer, and BEiT. Performance is tracked over 10 training epochs using accuracy and loss metrics on both training and test sets.
Overall, BEiT demonstrates the best balance between learning and generalization, making it the most robust among the three models in this evaluation.
The comparison between the full and downsampled datasets shows that model performance remains nearly unchanged, even when trained on significantly less data.
Despite the full dataset being over 400 GB and the downsampled version only 11 GB, accuracy and loss metrics are almost identical.
This finding highlights the potential for substantial resource savings—in terms of both storage and computational cost, without sacrificing model effectiveness.
The CLS token, designed to aggregate global information across the image, can be leveraged to create 2D attention heatmaps by visualizing how attention is distributed across patches.
The heatmaps show that as the model improves, its attention becomes more focused on disease-relevant regions.
Initially, the model fails to localize lesions accurately, but after training updates, it correctly highlights critical areas, demonstrating improved interpretability and diagnostic alignment.
These GIFs illustrate how attention heads from two different transformer architectures—Vision Transformer (ViT) and BEiT—distribute focus across chest X-ray images.
Overall, these visualizations highlight BEiT’s more effective attention allocation, resulting in greater interpretability and reliability in medical imaging tasks compared to ViT.
We evaluated three transformer models (ViT, Swin, and BEiT Transformer) on the CheXpert dataset, using both full and reduced versions to assess their performance.
Among them, BEiT demonstrated the most robust results, attributed to its use of masked-image modeling during pre-training, which enhances generalization and the ability to capture diverse image features.
For future work, we plan to conduct robustness tests by introducing small Gaussian noise or perturbations to input images, measuring performance degradation, and applying additional training to improve model resilience if necessary.
A1: Initially, the ViT model exhibited signs of overfitting when trained to predict only 4 labels. The limited binary classification task introduced significant randomness, causing the model to rely more on guesswork rather than genuine pattern learning.
After properly training the model to predict all 14 labels, enriching the complexity and diversity of the training data, the ViT model showed substantially reduced overfitting. This comprehensive labeling encouraged the model to learn robust features instead of memorizing limited patterns, resolving the previously observed issues.
Q2: Can you provide insights into why BEiT achieved superior performance compared to ViT and Swin Transformers?
A2: BEiT’s superior performance likely stems from its effective pre-training strategy, specifically masked image modeling, which encourages the model to learn generalized representations from image data.
Attention visualizations confirmed BEiT’s ability to consistently focus on clinically relevant regions, indicating enhanced internal representations and better decision-making capabilities for medical diagnosis tasks.
Q3: Can you provide more details on the benchmark comparison of the three transformer models?
A3: We conducted a comprehensive benchmark comparison of three transformer-based models—ViT, Swin, and BEiT—evaluating their accuracy, loss metrics, and interpretability through attention visualizations.
A detailed analysis highlighting each model's strengths and weaknesses can be found here.