← Back to projects

Multimodal Medical Image Classification using CLIP and ResNet

One-line summary: Explored multimodal and image-only approaches for diabetic retinopathy classification, comparing semantic alignment against fine-grained visual discrimination.

Key Results

ResNet-50 achieved 92.52% accuracy.
Two-stage CLIP approach achieved 89.31% accuracy.
Used Grad-CAM and t-SNE to analyze explainability and embedding structure.

What I Built

Multimodal CLIP-based classification pipeline.
Two-stage training flow: multimodal pretraining + image-only classifier.
Grad-CAM and t-SNE analysis stack for interpretability.

Technical Approach

Compared representation quality and downstream performance between CLIP and ResNet approaches.
Evaluated visual localization behavior using Grad-CAM maps.
Examined embedding separability and class structure using t-SNE projections.

Key Insight

CLIP improved semantic structure in the embedding space, but image-only models performed better when localized visual detail was critical.

Tools / Models Used

Python, PyTorch, CLIP, ResNet-50, Grad-CAM, t-SNE.

Optional links to paper / GitHub / PDF

Read full paper (PDF)