Visual Localization Using Sparse Semantic 3D Map (preprint)

Paper

most traditional mathods fails to locate the camera under a wide range of viewing conditions variations including season and illumination changes,as well as weather and day-night varitations.

combine image-based and structure-based localization with semantic information
separate into three parts:
- sparase semantic 3D map
  - apply off-the-shelf segmentation CNNs (DeepLabv3+ network) to all database images
  - reproject all database images to 3D point cloud
  - apply maximum voting to allocate a labels for each 3D point in 3D point cloud
  - remove dynamic objects in 3D point cloud to obtain M_s
  - M_s represents a cleaner sparse semenatic 3D map
- semantic score
  - obtain top-k ranked database images I_R for each query image I_Q by NetVLAD
  - for every selected Iⁱ_R, find 2D-3D matches through KNN search and ratio test (blue dotted lines)
  - obtain 2D-3D correpsondences between Iⁱ_R and M_s (green solid lines)
  - obtain 3D-2D matches between M_s and I_Q (red solid lines)
  - apply PnP solverto recover query pose
  - project all visiable 3D points into I_Q (visiable means 3D points should be seen by I_Q)
  - count # 3D points whose semenatic labels are the same as those in I_Q
- weighted RANSAC pose estimation
  - 2D-3D matches produced by the same Iⁱ_R are assigned the semantic score of Iⁱ_R
  - normalize each score by the sum of all 2D-3D match socres
  - use the normalized score as a weight p for RANSAC's sampling
  - different from removing 2D-3D matches with lower semantic scores
- appendix
  - DeepLabv3+ network (paper link)
  - NetVLAD (paper link)

visual localization dataset RobotCar Seasons (Benchamrk)
use DeepLabv3+network to segment all dataset images and assign a label to each 3D point by maximum voting with reprojecting pixel labels in all its visiable database images
in the image retrival step, use NetVLAD and pre-trained Pitts30K model generate 4096-dimensional descriptor vectors for each query and database images, and normalize L2 distances of the descriptors
retrival number: k=30 for day conditions; k=50 for night conditions
precision threshold: high (0.25m, 2deg); mdeium (0.5m, 5deg); coarse (5m, 10deg)
comparsion with related works: Dense-VLAD, NetVLAD, FAB-MAP, Active Search, CSL, Non-semantic