《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (11): 3698-3706.DOI: 10.11772/j.issn.1001-9081.2024111599

• 多媒体计算与计算机仿真 • 上一篇    

面向机械臂抓取的双目视觉目标定位算法

蒋畅江1,2(), 向杰1,2, 何旭颖1,2   

  1. 1.重庆邮电大学 自动化学院/工业互联网学院,重庆 400065
    2.工业物联网与网络化控制教育部重点实验室(重庆邮电大学),重庆 400065
  • 收稿日期:2024-11-11 修回日期:2024-12-30 接受日期:2025-01-07 发布日期:2025-01-14 出版日期:2025-11-10
  • 通讯作者: 蒋畅江
  • 作者简介:向杰(1997—),男,重庆人,硕士研究生,主要研究方向:计算机视觉、目标检测
    何旭颖(2000—),女,河南人,硕士研究生,主要研究方向:计算机视觉、目标检测。
  • 基金资助:
    国家自然科学基金资助项目(62277008)

Binocular vision object localization algorithm for robot arm grasping

Changjiang JIANG1,2(), Jie XIANG1,2, Xuying HE1,2   

  1. 1.School of Automation/Industrial Internet of Things,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2.Key Laboratory of Industrial Internet of Things and Networked Control,Ministry of Education (Chongqing University of Posts and Telecommunications),Chongqing 400065,China
  • Received:2024-11-11 Revised:2024-12-30 Accepted:2025-01-07 Online:2025-01-14 Published:2025-11-10
  • Contact: Changjiang JIANG
  • About author:XIANG Jie, born in 1997, M. S. candidate. His research interests include computer vision, object detection.
    HE Xuying, born in 2000, M. S. candidate. Her research interests include computer vision, object detection.
  • Supported by:
    National Natural Science Foundation of China(62277008)

摘要:

通过机器视觉算法对目标进行识别并定位它的空间坐标是实现机械臂视觉抓取的关键。针对基于双目视觉的目标识别与定位中定位精度低、运行效率不高等问题,提出面向机械臂抓取的联合双目视觉目标检测与立体深度估计的网络结构BDS-YOLO(Binocular Detect and Stereo YOLO)及基于BDS-YOLO的目标定位算法。该算法联合目标检测与立体深度估计算法,利用注意力机制进行跨视图特征信息交互,从而提高特征表达能力,使网络可以通过深度特征匹配获得高质量视差图,再经过自注意力机制进一步提升后,由三角测量原理转换为深度信息。BDS-YOLO网络采用多任务学习,同时训练目标检测与立体深度估计网络,并使用合成与真实数据共同训练。针对真实数据不易标注密集深度的问题,采用自监督学习技术优化由视差重建图像的过程,以提高BDS-YOLO网络对现实世界的泛化能力。实验结果表明:BDS-YOLO网络在真实数据集上对目标检测的平均精度(AP)比YOLOv8l高6.5个百分点,预测的视差和转换后的深度优于专门的立体深度估计算法,推理速度可达20 frame/s以上,对目标对象的识别和定位均优于对比方法,能较好地满足目标实时检测与定位的需求。

关键词: 双目视觉, 目标检测, 立体匹配, 立体深度估计, 目标定位, 深度学习, 注意力机制

Abstract:

Recognizing the object and locating its spatial coordinates using machine vision algorithm is crucial for achieving visual grasping with robotic arms. Aiming at the problems of low localization accuracy and inefficient performance in binocular vision-based object recognition and localization, a BDS-YOLO (Binocular Detect and Stereo YOLO)-based binocular vision object localization algorithm for robotic arm grasping was proposed, which joints object detection and stereo depth estimation. The algorithm integrated object detection with stereo depth estimation algorithm, leveraging attention mechanisms for cross-view feature interaction to enhance feature representation. This enabled the network to obtain high-quality disparity maps through depth feature matching. After being further improved through self-attention mechanism, the disparity maps were converted into depth information using triangulation principle. BDS-YOLO network adopted multi-task learning to jointly train both object detection and stereo depth estimation networks using both synthetic and real-world data. To overcome the challenge of annotating dense depth for real data, self-supervised learning technology was applied to optimize the image reconstruction process from disparities, improving generalization ability of the BDS-YOLO network in real-world scenarios. Experimental results show that BDS-YOLO network achieves a 6.5 percentage points higher Average Precision (AP) in object detection compared to YOLOv8l on real-world dataset, outperforms specialized stereo depth estimation algorithm in disparity prediction and depth conversion, achieves an inference speed of over 20 frame/s, and surpasses comparative methods in both object recognition and localization. It can be seen that BDS-YOLO network can meet the requirements for real-time object detection and localization.

Key words: binocular vision, object detection, stereo matching, stereo depth estimation, object localization, deep learning, attention mechanism

中图分类号: