Multimodal Medical Image Classification using CLIP and ResNet
One-line summary: Explored multimodal and image-only approaches for diabetic retinopathy classification, comparing semantic alignment against fine-grained visual discrimination.
Key Results
- ResNet-50 achieved 92.52% accuracy.
- Two-stage CLIP approach achieved 89.31% accuracy.
- Used Grad-CAM and t-SNE to analyze explainability and embedding structure.
What I Built
- Multimodal CLIP-based classification pipeline.
- Two-stage training flow: multimodal pretraining + image-only classifier.
- Grad-CAM and t-SNE analysis stack for interpretability.
Technical Approach
- Compared representation quality and downstream performance between CLIP and ResNet approaches.
- Evaluated visual localization behavior using Grad-CAM maps.
- Examined embedding separability and class structure using t-SNE projections.
Key Insight
CLIP improved semantic structure in the embedding space, but image-only models performed better when localized visual detail was critical.
Tools / Models Used
Python, PyTorch, CLIP, ResNet-50, Grad-CAM, t-SNE.