HEDWIG: Learning Geospatial Embeddings for Large-Scale Retrieval
One-line summary: Built a ViCLIP-based geolocation system that learns richer geospatial embeddings from multi-frame panoramic imagery and captions.
Key Results
- Reduced median top-1 prediction error by over 1600 km compared with CLIP.
- Increased the proportion of predictions within 750 km by nearly 4x.
- Improved top-1 retrieval quality across distance thresholds.
What I Built
- ViCLIP-based multimodal embedding pipeline.
- Projection and classification layers for geolocation clustering.
- Geocell-based retrieval workflow using similarity search.
- Experiment pipeline for captioning, clustering, and ablation studies.
Technical Approach
- Processed panoramic viewpoints into a multi-frame representation pipeline.
- Trained embedding and clustering heads for geocell prediction.
- Benchmarked retrieval and geolocation quality against CLIP baselines.
Key Insight
Averaging embeddings across viewpoints weakens location signal; weighted multi-frame representations improve geospatial retrieval and clustering.
Tools / Models Used
Python, PyTorch, ViCLIP, CLIP, geospatial clustering, retrieval, similarity search.