NSF Biometrics Performance Disparity Codebase

The dataset curation pipeline begins by collecting large-scale unlabeled face images from sources like LAION and BUPT, while using FairFace as a balanced reference dataset. All images are embedded using a pretrained CLIP encoder, and duplicates in the unlabeled set are removed via cosine similarity. A similarity search then retrieves diverse and high-quality samples that align with FairFace examples, with optional filtering through a face image quality assessment tool to remove low-quality images. The final curated dataset (~200K images) combines FairFace with selected unlabeled samples, ensuring diversity, balance, and quality for fair and effective facial attribute classification.

View on GitHub