Skip to content

Comments

🐛 [FIX] Fix Float16 overflow problem in box_nms#234

Open
TrungDinhT wants to merge 1 commit intoMultimediaTechLab:mainfrom
TrungDinhT:bugfix/fix_non_max_suppression
Open

🐛 [FIX] Fix Float16 overflow problem in box_nms#234
TrungDinhT wants to merge 1 commit intoMultimediaTechLab:mainfrom
TrungDinhT:bugfix/fix_non_max_suppression

Conversation

@TrungDinhT
Copy link

Problem

When running inference with precision="16-mixed" (or any configuration that produces float16 model outputs), bbox_nms sometimes silently fails to suppress overlapping bounding boxes when the batch contains many images. This results in many duplicate detections surviving NMS for the same object.

Root Cause

batched_nms (from torchvision.ops) internally separates NMS groups by spatially shifting boxes:

shifted_boxes = boxes + (label * (max_coord + 1))

This shift is designed to guarantee that boxes from different (image, class) groups never overlap. However, when box coordinates are float16, the shifted coordinates quickly exceed float16's safe precision range. For example:

  • With max_coord ≈ 3500 and label = 18 (image 2, class 2 in a batch of 8): offset = 18 × 3501 + 3500 = 66518, which exceeds float16 precision.
  • The y-coordinates of different boxes within the same cluster collapse to the same float16 bucket, making their computed IoU incorrect.
  • Some box pairs that should have IoU > 0.5 compute as IoU ≈ 0, so NMS skips suppression and all duplicates survive.

The bug is invisible with a single image because valid_cls labels stay small (0, 1, 2), producing tiny offsets that float16 handles correctly. It surfaces as soon as batch_idx + valid_cls * B produces big enough labels.

Fix

Cast valid_box and valid_con to float32 before passing to batched_nms:

def bbox_nms(cls_dist: Tensor, bbox: Tensor, nms_cfg: NMSConfig, confidence: Optional[Tensor] = None):
    ...
    nms_idx = batched_nms(valid_box.float(), valid_con.float(), batch_idx + valid_cls * bbox.size(0), nms_cfg.min_iou)

This is effectively free - at the NMS stage we are working with a small set of filtered detections, not the full feature map.

Test

Added test_bbox_nms_float16_precision() which creates the extreme scenario where box_nms failed to suppress overlapped box due to Float16 overflow. This test fails before this fix and passes after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant