Introduction

Autonomous driving is one of the most exciting topics in the fields of machine learning and deep learning. In recent years, the technology behind autonomous driving has advanced rapidly in both academia and industry. The various aspects of autonomous driving can be divided into three main modules: perception, prediction, and decision-making. Due to the data-driven nature of deep learning models, effective algorithms require high-quality data sets. If these high standards are not met, the desired outcomes are unlikely to be achieved. Currently, it is acknowledged that in the decision-making module, training on 1 million kilometers of data can lead to better results. However, no similar benchmark exists for the perception module.

Previous 3D Lidar detection algorithms often overlooked vulnerable traffic groups, such as pedestrians and cyclists. We aim to propose a straightforward strategy or framework that maximizes the utility of unit data frames within enhanced datasets, thereby improving the algorithm’s detection of vulnerable traffic groups.

The simple main pipeline of this project

Unfortunately, Lidar sensors are expensive, making them unaffordable for our project team. Furthermore, collecting and annotating Lidar data poses significant challenges. Compared to traditional data or image data for autonomous driving, Lidar data is often “too sparse, abstract in description, and difficult to visualize.” This is especially true for identifying traffic-disadvantaged groups, as labeling them based on real data is particularly challenging.

To address these issues, our project employs a simulation environment for data collection and preparation. We have developed a set of tools to automatically collect and annotate data within this simulation environment, which allows us to directly extract the 3D positions of targets for annotation. Additionally, given the low frequency of vulnerable groups in existing datasets, we constructed simple scenarios in high dimensions to increase the sample size of these groups.

In summary, to tackle the challenges posed by the small and unclear samples of vulnerable traffic groups (pedestrians and cyclists) in traditional datasets, our project has built custom collection tools based on a simulation environment. We also proposed a strategic framework to enhance data collection. To verify its effectiveness, we tested our custom dataset across multiple algorithms, observing a significant improvement in the detection of vulnerable traffic groups.

CARLA is a powerful open-source simulator designed for autonomous driving research. It can create a virtual urban environment and simulate various sensors, including cameras, LiDAR, and mmWave radar, to provide essential data. Many researchers have developed their self-driving systems within the Carla environment. By utilizing established object detection algorithms such as YOLO and Faster R-CNN to process the data generated by Carla, they can effectively implement object detection in their systems. The same applies to object tracking, where algorithms like GOTURN and Deep SORT can be employed to achieve successful tracking.

For this project, we will use the open-source 3D point cloud detection algorithm training framework, OpenPCDet. This framework is currently a popular and lightweight option for point cloud algorithm training and supports a variety of network architectures.

Model	Car@R11	Pedestrian@R11	Cyclist@R11	Dataset
PointPillar	77.28	52.29	62.68	KITTI
SECOND	78.62	52.98	67.15	KITTI
Voxel R-CNN	84.54	-	-	KITTI
BEVFusion	67.75	-	-	nuScenes
CenterPoint	78.08	49.74	67.22	ONCE
Voxel NeXt	30.05	-	-	Argoverse2

Among these, we chose the PointPillar model for experiments to verify that our method has improved the detection effect of if groups. Voxel R-CNN was selected for experiments to verify that the traditional vehicle detection effect has also been improved.

Methodology

Co-simulation

This section primarily focuses on creating the simulation scene. In the Carla map, there are numerous static vehicles that are integral to the map design and are not generated through program control. Consequently, their 3D positions do not appear in memory and cannot be annotated. During the training and verification phases, the model may detect these static vehicles but might misclassify them due to the lack of annotations.

Operate source code version Carla in the Unreal4

Therefore, we first operate in the source version of Carla and use the Unreal4 toolkit to eliminate static vehicles. After that, we used the Carla-Apollo Bridge to let Apollo take over Carla’s dynamic scene settings and perform visual operations in DreamViewer.

Operate source code version Carla in the Unreal4

Problem Definition

Given a 3D point cloud, $\mathbf{P}=\{p_1,p_2,...,p_n\}$, which represents the set of measured points, $p_i=(x_i,y_i,z_i)$, and presents a snapshots of the surroundings. For objects, $\mathbf{O}=\{o_1,o_2,...,o_m\}$, represents the set of all objects in point cloud (vehicle, traffic light, pedestrian, etc.), using KITTI format, where $o_j=(x_j,y_j,z_j,w_j,h_j,l_j,r_j)$.

Use $S_{keep}$ to measure whether to keep the Lidar data.

$S_{keep} \equiv \mathcal{E} (P) \cdot \mathcal{P}(\rho(O),\tau(O))$ where $\mathcal{E}$ decides whether to keep the entire point cloud, and $\mathcal{P}$ decides whether to keep the collected target data.

Use $S_{value}$ to represent the value of current Lidar data for the trainning model,

\[S_{value}\equiv \phi(P, O) \cdot \psi(O)\]

where $\phi(P, O)$ is the perception distance term, used to describe the relationship between the effective perception radius and the farthest perception radius; $\psi(O)$ is the prediction accuracy term, used to describe the perception How accurate is the prediction of objects within the radius.

Pedestrian Control Algorithm

For pedestrian control in the Co-Simulate link, there is the following algorithm. It controls pedestrian behavior during the time between two timestamps, where $pos$ is the position, $range$ is perception range (only 160 degrees in front of the eyes), $towards$ is the absolute angle of orientation, and $speed$ is the speed of a pedestrian, respectively.

Simple Pedestrian control pesudo code

Feature Upgrade

The following is the entire process of our improved VFE processing stage.

De-mean encoding $(P, M, 4)\xrightarrow{}(P, M, 3)$ for each point in Pillar
Decentralize the effective points in each Pillar $(P, M, 4)\xrightarrow{} (P, M, 2)$
Mask merge coding: Combine the original $(P, M, 4)$ with the above two codes cat to get the vector of $(P, M, 9)$. There are two points to note here:
- Only valid points ($n$ points per pillar) are operated in each Pillar. If the number of valid points is insufficient, zero will be added, if there are too many, random sample will be used;
- In the code, the 9-dimensional encoding vector is The first 2 dimensions are replaced by decentralized encoded vectors
Convolution kernel pooling: $(P, M, 9)\xrightarrow{}(P, M, 64)$ and $(P, 64)$
Pillarscatter: Go to the 2D feature map of $(X/vsize, Y/vsize)$ and get the feature map of $(64, X/vsize, Y/vsize)$.

The PointPillar model utilizes a method that differs from Voxel in describing point clouds by employing Pillars, which disregards certain information along the Z-axis. During the VFE encoding process, since the model does not take this information into account, it can also omit it during encoding. This approach enhances coding speed and reduces both training and inference times.

Experiment

A total of three data sets were collected, of which A did not use the scenes we built, and B and D used the scenes we built. Each data set consists of 5 subsets, and the number of “vehicles (including cyclists) and pedestrians” in each subset are $(50,25)$, $(75,37)$, $(100,50)$, $(125,62)$ and $(150,75)$.

Dataset name	Total Frames	Map	Detail
A	987	Town05	City
B	902	Town02	Town
D	988	Town06	Highway
V	375	-	Random select from A,B, D

Then, use datasets A, B, and D to train on PointPillar and Voxel R-CNN, respectively. Use epoch = 160, batch size = 18, dynamically adjust the learning rate, and set Random seed = 114. This results in a total of 6 models. All model training is performed on the server, and the server parameters are as follows.

CPU: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
GPU: NVIDIA TITAN V $\times \ 6$
OS: Ubuntu 22.04.2 LTS
MEM: 453 G

Results

Via TensorBoard tool, the following is the exported curve of the loss decreasing as the step increases.

Training loss curves of Pointpillars (left) and Voxel R-CNN (right).

The following are the evaluation results of Pointpillars after training in OpenPCDet. The indicator uses mAP70, that is, mAP above 70 is calculated as correct recognition.

Dataset	Car	Truck	Van	Pedestrian	Cyclist
A	53.86	68.32	52.78	38.08	45.11
B	60.28	71.27	54.11	40.87	56.30
D	64.09	68.59	71.09	48.05	52.26

Similarly, the result of Voxel-RCNN is shown as the followed.

Dataset	Vehicle
A	8.63
B	64.28
D	63.21

Visual display of some data of the model on the test set.

Visualization of different models trained on different data sets under the numbered data frames shown

It is not difficult to find that the PointPillar model can basically accurately detect small objects in the distance. Voxel R-CNN has correctly and completely detected all vehicles in this scene, even those with severe occlusion.

Conclusions

In this prpject, we address the current oversight of vulnerable traffic groups, such as pedestrians and cyclists, in 3D detection algorithms. We propose a project framework utilizing the Carla simulation environment, which encompasses scene construction, data preparation, and model training. We conducted experiments using the PointPillar algorithm and the Voxel R-CNN algorithm within the OpenPCDet framework. The experimental results demonstrate that our approach is more effective, significantly enhancing the 3D detection capabilities for vulnerable traffic groups while also improving performance in general vehicle detection tasks.