Researchers have developed a technique that allows artificial intelligence (AI) programs to better map three-dimensional space using two-dimensional images captured by multiple cameras. This technology has the potential to improve navigation in self-driving cars because it operates effectively with limited computational resources.
“Most self-driving vehicles use powerful AI programs called vision converters to take 2D images from multiple cameras and create a representation of the 3D space around the vehicle,” says Tianfu Wu, an associate professor of electrical engineering and corresponding author on the paper. in Computer Science from North Carolina State University. “But while each of these AI programs takes a different approach, there is still significant room for improvement.
“Our technology, called Multi-View Attentive Contextualization (MvACon), is a plug-and-play complement that can be used with existing vision converter AI to enhance 3D spatial mapping capabilities,” Wu said. “Vision transducers don’t get additional data from the camera, they just make better use of it.”
MvACon effectively works by modifying an approach called Patch-to-Cluster attention (PaCa) that Wu and his colleagues published last year. PaCa allows Translator AI to identify objects in images more efficiently and effectively.
“The important advance here is to apply what we demonstrated with PaCa to 3D spatial mapping using multiple cameras,” Wu said.
To test the performance of MvACon, the researchers used it with three leading vision transformers: BEVFormer, BEVFormer DFA3D variant, and PETR. In each case, the vision transducer collected 2D images from six different cameras. In all three cases, MvACon significantly improved the performance of each vision converter.
“The performance has been particularly improved in terms of object localization as well as the speed and direction of those objects,” Wu said. “And the increase in computing demands required to add MvACon to the vision transformer was almost negligible.
“Next steps include testing MvACon on additional benchmark datasets and against real video inputs from autonomous vehicles. If MvACon continues to outperform existing vision converters, we will adopt it for widespread use. “I’m optimistic that it will happen.”
The paper, “Multiview Attentional Contextualization for Multiview 3D Object Detection,” will be presented June 20 at the IEEE/CVF Conference on Computer Vision and Pattern Recognition in Seattle, Washington. The first author of the paper is Xianpeng. Liu recently earned his Ph.D. Graduated from NC State University. The paper was co-authored by Ce Zheng and Chen Chen of the University of Central Florida. Ming Qian and Nan Xue of Ant Group; Zhebin Zhang and Chen Li of OPPO US Research Center.
This work was supported by the National Science Foundation under grants 1909644, 2024688, and 2013451. US Army Research Office, grants W911NF1810295 and W911NF2210010; Research funding from Innopeak Technology, Inc.