Shinjeong Kim - Unsupervised Feature Detection and Description

Unsupervised Keypoint Detection and Description

Repeatability, Mean Localization Error, Matching Score comparison for SceneCity dataset. The upper row is a result of reproduced UnsuperPoint. The lower row is the result of our proposed approach.

An example pair of our SceneCity dataset. (0.2m translation for the x-axis and 10° rotation about the y-axis). We used SceneCity addon for Blender to make this synthetic dataset for evaluation.

Matching results for homography estimation dataset (HPatches).

(a), (b) : Illumination change scene, (c), (d) : viewpoint change scene, (a), (c) : estimated correspondences of keypoints (red: unmatched, green: matched), (b), (d) : correct (green) and incorrect (red) correspondences

Abstract

This research presents a novel unsupervised framework and end-to-end training procedure to train keypoint detectors and descriptors. Our framework and training procedure were designed to extract robust keypoints and descriptions under geometric challenges in the real-world such as abrupt depth change or occlusion around keypoints. Compared to previous approaches, we feed more realistic training images synthesized with two images (from the training dataset) and corresponding homographies. When trained with MS-COCO dataset and evaluated with HPatches and SceneCity dataset, our framework has a practically better descriptor matching performance compared to reproduced UnsuperPoint (our baseline).

Contribution: I solely conducted this research with the supervision of Hyeok-Jun Kwon and Professor Kuk-Jin Yoon

Code

Methods

Our training data provider mimics real-world geometry such as occlusion. Image I1 and I2 are from the MS-COCO dataset; M is a mask that has the same size as both images. Quadrangles forming the mask are drawn in a random manner.

Our network structure (the figure is modified from that of UnsuperPoint paper). Two descriptor vectors are extracted for one keypoint; we expect the descriptor branches to learn different parts around keypoints, especially for keypoints near edges (e.g. one descriptor focuses on the nearer parts, the other focuses on the farther parts). Input images I1, I2, I+ are provided to our model in a siamese manner. The resulting keypoints and descriptors are compared so that our network learns edge-aware keypoint detector and descriptor.

Page updated

Google Sites

Report abuse