3D Traffic Vulnerable Group Detection in Simulation-driven Autonomous Driving
Introduction
Autonomous driving is one of the most exciting topics in the fields of machine learning and deep learning. In recent years, the technology behind autonomous driving has advanced rapidly in both academia and industry. The various aspects of autonomous driving can be divided into three main modules: perception, prediction, and decision-making. Due to the data-driven nature of deep learning models, effective algorithms require high-quality data sets. If these high standards are not met, the desired outcomes are unlikely to be achieved. Currently, it is acknowledged that in the decision-making module, training on 1 million kilometers of data can lead to better results. However, no similar benchmark exists for the perception module.
Previous 3D Lidar detection algorithms often overlooked vulnerable traffic groups, such as pedestrians and cyclists. We aim to propose a straightforward strategy or framework that maximizes the utility of unit data frames within enhanced datasets, thereby improving the algorithm’s detection of vulnerable traffic groups.
Unfortunately, Lidar sensors are expensive, making them unaffordable for our project team. Furthermore, collecting and annotating Lidar data poses significant challenges. Compared to traditional data or image data for autonomous driving, Lidar data is often “too sparse, abstract in description, and difficult to visualize.” This is especially true for identifying traffic-disadvantaged groups, as labeling them based on real data is particularly challenging.
To address these issues, our project employs a simulation environment for data collection and preparation. We have developed a set of tools to automatically collect and annotate data within this simulation environment, which allows us to directly extract the 3D positions of targets for annotation. Additionally, given the low frequency of vulnerable groups in existing datasets, we constructed simple scenarios in high dimensions to increase the sample size of these groups.
In summary, to tackle the challenges posed by the small and unclear samples of vulnerable traffic groups (pedestrians and cyclists) in traditional datasets, our project has built custom collection tools based on a simulation environment. We also proposed a strategic framework to enhance data collection. To verify its effectiveness, we tested our custom dataset across multiple algorithms, observing a significant improvement in the detection of vulnerable traffic groups.
Related Work
CARLA is a powerful open-source simulator designed for autonomous driving research. It can create a virtual urban environment and simulate various sensors, including cameras, LiDAR, and mmWave radar, to provide essential data. Many researchers have developed their self-driving systems within the Carla environment. By utilizing established object detection algorithms such as YOLO and Faster R-CNN to process the data generated by Carla, they can effectively implement object detection in their systems. The same applies to object tracking, where algorithms like GOTURN and Deep SORT can be employed to achieve successful tracking.
For this project, we will use the open-source 3D point cloud detection algorithm training framework, OpenPCDet. This framework is currently a popular and lightweight option for point cloud algorithm training and supports a variety of network architectures.
| Model | Car@R11 | Pedestrian@R11 | Cyclist@R11 | Dataset |
|---|---|---|---|---|
| PointPillar | 77.28 | 52.29 | 62.68 | KITTI |
| SECOND | 78.62 | 52.98 | 67.15 | KITTI |
| Voxel R-CNN | 84.54 | - | - | KITTI |
| BEVFusion | 67.75 | - | - | nuScenes |
| CenterPoint | 78.08 | 49.74 | 67.22 | ONCE |
| Voxel NeXt | 30.05 | - | - | Argoverse2 |
Among these, we chose the PointPillar model for experiments to verify that our method has improved the detection effect of if groups. Voxel R-CNN was selected for experiments to verify that the traditional vehicle detection effect has also been improved.
Methodology
Co-simulation
This section primarily focuses on creating the simulation scene. In the Carla map, there are numerous static vehicles that are integral to the map design and are not generated through program control. Consequently, their 3D positions do not appear in memory and cannot be annotated. During the training and verification phases, the model may detect these static vehicles but might misclassify them due to the lack of annotations.
Therefore, we first operate in the source version of Carla and use the Unreal4 toolkit to eliminate static vehicles. After that, we used the Carla-Apollo Bridge to let Apollo take over Carla’s dynamic scene settings and perform visual operations in DreamViewer.
Problem Definition
Given a 3D point cloud, \(\mathbf{P}=\{p_1,p_2,...,p_n\}\), which represents the set of measured points, \(p_i=(x_i,y_i,z_i)\), and presents a snapshots of the surroundings. For objects, \(\mathbf{O}=\{o_1,o_2,...,o_m\}\), represents the set of all objects in point cloud (vehicle, traffic light, pedestrian, etc.), using KITTI format, where \(o_j=(x_j,y_j,z_j,w_j,h_j,l_j,r_j)\).
Use \(S_{keep}\) to measure whether to keep the Lidar data.
\(S_{keep} \equiv \mathcal{E} (P) \cdot \mathcal{P}(\rho(O),\tau(O))\) where $\mathcal{E}$ decides whether to keep the entire point cloud, and $\mathcal{P}$ decides whether to keep the collected target data.
Use \(S_{value}\) to represent the value of current Lidar data for the trainning model,
\[S_{value}\equiv \phi(P, O) \cdot \psi(O)\]where \(\phi(P, O)\) is the perception distance term, used to describe the relationship between the effective perception radius and the farthest perception radius; \(\psi(O)\) is the prediction accuracy term, used to describe the perception How accurate is the prediction of objects within the radius.
Pedestrian Control Algorithm
For pedestrian control in the Co-Simulate link, there is the following algorithm. It controls pedestrian behavior during the time between two timestamps, where $pos$ is the position, $range$ is perception range (only 160 degrees in front of the eyes), $towards$ is the absolute angle of orientation, and $speed$ is the speed of a pedestrian, respectively.
Feature Upgrade
The following is the entire process of our improved VFE processing stage.
- De-mean encoding \((P, M, 4)\xrightarrow{}(P, M, 3)\) for each point in Pillar
- Decentralize the effective points in each Pillar \((P, M, 4)\xrightarrow{} (P, M, 2)\)
- Mask merge coding: Combine the original \((P, M, 4)\) with the above two codes cat to get the vector of \((P, M, 9)\). There are two points to note here:
- Only valid points (\(n\) points per pillar) are operated in each Pillar. If the number of valid points is insufficient, zero will be added, if there are too many, random sample will be used;
- In the code, the 9-dimensional encoding vector is The first 2 dimensions are replaced by decentralized encoded vectors
- Convolution kernel pooling: \((P, M, 9)\xrightarrow{}(P, M, 64)\) and \((P, 64)\)
- Pillarscatter: Go to the 2D feature map of \((X/vsize, Y/vsize)\) and get the feature map of \((64, X/vsize, Y/vsize)\).
The PointPillar model utilizes a method that differs from Voxel in describing point clouds by employing Pillars, which disregards certain information along the Z-axis. During the VFE encoding process, since the model does not take this information into account, it can also omit it during encoding. This approach enhances coding speed and reduces both training and inference times.
Experiment
A total of three data sets were collected, of which A did not use the scenes we built, and B and D used the scenes we built. Each data set consists of 5 subsets, and the number of “vehicles (including cyclists) and pedestrians” in each subset are \((50,25)\), \((75,37)\), \((100,50)\), \((125,62)\) and \((150,75)\).
| Dataset name | Total Frames | Map | Detail |
|---|---|---|---|
| A | 987 | Town05 | City |
| B | 902 | Town02 | Town |
| D | 988 | Town06 | Highway |
| V | 375 | - | Random select from A,B, D |
Then, use datasets A, B, and D to train on PointPillar and Voxel R-CNN, respectively. Use epoch = 160, batch size = 18, dynamically adjust the learning rate, and set Random seed = 114. This results in a total of 6 models. All model training is performed on the server, and the server parameters are as follows.
- CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
- GPU: NVIDIA TITAN V \(\times \ 6\)
- OS: Ubuntu 22.04.2 LTS
- MEM: 453 G
Results
Via TensorBoard tool, the following is the exported curve of the loss decreasing as the step increases.
The following are the evaluation results of Pointpillars after training in OpenPCDet. The indicator uses mAP70, that is, mAP above 70 is calculated as correct recognition.
| Dataset | Car | Truck | Van | Pedestrian | Cyclist |
|---|---|---|---|---|---|
| A | 53.86 | 68.32 | 52.78 | 38.08 | 45.11 |
| B | 60.28 | 71.27 | 54.11 | 40.87 | 56.30 |
| D | 64.09 | 68.59 | 71.09 | 48.05 | 52.26 |
Similarly, the result of Voxel-RCNN is shown as the followed.
| Dataset | Vehicle |
|---|---|
| A | 8.63 |
| B | 64.28 |
| D | 63.21 |
Visual display of some data of the model on the test set.
It is not difficult to find that the PointPillar model can basically accurately detect small objects in the distance. Voxel R-CNN has correctly and completely detected all vehicles in this scene, even those with severe occlusion.
Conclusions
In this prpject, we address the current oversight of vulnerable traffic groups, such as pedestrians and cyclists, in 3D detection algorithms. We propose a project framework utilizing the Carla simulation environment, which encompasses scene construction, data preparation, and model training. We conducted experiments using the PointPillar algorithm and the Voxel R-CNN algorithm within the OpenPCDet framework. The experimental results demonstrate that our approach is more effective, significantly enhancing the 3D detection capabilities for vulnerable traffic groups while also improving performance in general vehicle detection tasks.