
Visual Localization Using Sparse Semantic 3D Map


  • most traditional mathods fails to locate the camera under a wide range of viewing conditions variations including season and illumination changes,as well as weather and day-night varitations.

Proposed scheme

  • combine image-based and structure-based localization with semantic information
  • separate into three parts:
    • sparase semantic 3D map
      • apply off-the-shelf segmentation CNNs (DeepLabv3+ network) to all database images
      • reproject all database images to 3D point cloud
      • apply maximum voting to allocate a labels for each 3D point in 3D point cloud
      • remove dynamic objects in 3D point cloud to obtain Ms
      • Ms represents a cleaner sparse semenatic 3D map
    • semantic score
      • obtain top-k ranked database images IR for each query image IQ by NetVLAD
      • for every selected IiR, find 2D-3D matches through KNN search and ratio test (blue dotted lines)
      • obtain 2D-3D correpsondences between IiR and Ms (green solid lines)
      • obtain 3D-2D matches between Ms and IQ (red solid lines)
      • apply PnP solverto recover query pose
      • project all visiable 3D points into IQ (visiable means 3D points should be seen by IQ)
      • count # 3D points whose semenatic labels are the same as those in IQ
    • weighted RANSAC pose estimation
      • 2D-3D matches produced by the same IiR are assigned the semantic score of IiR
      • normalize each score by the sum of all 2D-3D match socres
      • use the normalized score as a weight p for RANSAC's sampling
      • different from removing 2D-3D matches with lower semantic scores
    • appendix


  • visual localization dataset RobotCar Seasons (Benchamrk)
  • use DeepLabv3+network to segment all dataset images and assign a label to each 3D point by maximum voting with reprojecting pixel labels in all its visiable database images
  • in the image retrival step, use NetVLAD and pre-trained Pitts30K model generate 4096-dimensional descriptor vectors for each query and database images, and normalize L2 distances of the descriptors
  • retrival number: k=30 for day conditions; k=50 for night conditions
  • precision threshold: high (0.25m, 2deg); mdeium (0.5m, 5deg); coarse (5m, 10deg)
  • comparsion with related works: Dense-VLAD, NetVLAD, FAB-MAP, Active Search, CSL, Non-semantic