Projects

Active

Vision-Language Models

Spatial Understanding for Vision-Language Models

I am now interested in spatial understanding for VLMs — improving the spatial representation for VLMs and efficient VLMs, building on our prior work on context-aware object recognition and closed-form adaptation (Koo-Fu CLIP).

Robust Context-Aware Object Recognition Koo-Fu CLIP

Recognition & Benchmarks

Aiming for Perfect ImageNet-1k

The aim is complete validation set reannotation — an ongoing, soon to be published project. Building on our analysis of ImageNet flaws and VLM-based recognition methods.

Image Recognition with VLMs Flaws of ImageNet Multimodal Large Language Models as Image Classifiers ImageNet Reannotation Project

Video Understanding

RL for Video Object Segmentation

Investigating reinforcement learning for learned memory control in the Segment Anything Model 2 (SAM2), with the goal of improving long-form video object segmentation by dynamically managing the memory bank.

SAM2RL

Past

2023 – 2025

Fine-grained Species Classification & Biodiversity Benchmarks

Creating large-scale multi-modal datasets and challenges for fine-grained visual categorization of fungi and other species. Co-organized the FungiCLEF competition, the FGVC @ CVPR 2025 workshop edition, and contributed to the broader LifeCLEF benchmarking ecosystem.

FungiTastic (CVPR 2025) FungiCLEF 2025 overview LifeCLEF 2025

2022 – 2024

Test-Time Adaptation for Segmentation

Methods for adapting segmentation models at test time using only a single image, without access to training data or labels. Work done in collaboration with Chaim Baskin and Alex Bronstein (Technion).

SITTA (TMLR 2024) MSc Thesis

2021 – 2022

AI-Assisted Labeling for Civil Infrastructure Inspection

Leveraging model explainability and foundation models to bootstrap labeling for visual inspection of bridges and buildings. Research internship at IBM Research Zurich.

ECCV 2022 Workshop (Best Paper) EUROSTRUCT 2023

2019 – 2021

Scene Text Detection & Recognition

Exploring real-world and synthetic data sources for training robust scene text detection and recognition models. Research visit to the Computer Vision Centre (CVC) at UAB, Barcelona.

Research Projects