D-FINE is a real-time object detection model introduced in the paper " D-FINE: Redefine Regression Task of DETRs as Fine‑grained Distribution Refinement".
You can retrieve bounding boxes whose edges match an angled object by training an oriented bounding boxes object detection model, such as YOLOv8's Oriented Bounding Boxes model.
MobileCLIP is an image embedding model developed by Apple and introduced in the "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" paper
Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.
RTMDet is an efficient real-time object detector, with self-reported metrics outperforming the YOLO series. It achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, making it one of the fastest and most accurate object detectors available as of writing this post.