Introducing Nemo’s Offboard Perception networks — designed to convert your big data to smart data

7 min readAug 12, 2021

by Gaurav Singh @ Perception, Nemo

Data collected by today’s ADAS and Autonomous Vehicles contain many examples of common object types like cars driving near the ego vehicle but fewer examples of interesting scenarios like jaywalking pedestrians, trucks dangerously cutting into the ego lane or edge cases like rare snowy day drive in Houston. From a machine learning training and system testing point of view, these edge cases and key scenarios are more important, and hence not all data is created equal. This falls in line with Andrew Ng’s quote that massive data sets aren’t essential for A.I. innovation. Converting big data to smaller, relevant datasets is what makes the AI systems better.

Nemo’s offboard perception systems that are developed specifically for sifting through vast amounts of fleet collected data, aid the automotive teams to accurately tag their raw sensor data and identify the key scenarios and edge cases through an automated data pipeline.

These offboard perception networks run on the cloud and act complementary to the in-vehicle, low-compute optimized perception systems. The networks detect, classify and track all objects present in the scene and feed the tracks into high-level scenario analyzers like a cut-in detector. An additional key requirement for these offboard networks is to detect edge cases that affect sensor and system outputs such as extreme weather, heavy sun glare, etc. that provide additional context to the collected data.

This first of many blogs coming from our perception team will focus on our camera-only approaches for object track generation and introduces our (patent pending) deep learning network — a weakly supervised monocular camera perception that detects and tracks road objects. The resulting perception network is light-weight, fast and reduces annotation costs considerably while giving superior performance.

2D Bounding Box Approach and its limitations for camera-only object track extraction

Cameras are the most ubiquitous sensors in today’s vehicles. Ranging from after-market dashcams to camera-based Autopilot systems, the data from monocular cameras is increasing at an exponential rate in the automotive industry. Hence, camera-only depth estimation networks form an important class of networks we are developing in our offboard perception systems.

The first and most simplest approach one could employ for object depth estimation in a monocular camera setting is to detect 2d bounding boxes on videos using deep learning networks, followed by association-based object tracking. In such a method, one can use the bounding box coordinates, camera calibration parameters like camera intrinsic matrix, distortion parameters, pitch angle and camera height to extract relative lateral and longitudinal distance to the objects in the scene. Figure 1 shows the steps to extract 3D object tracks and Figure 2 shows the detected object tracks using this approach.

Figure 1: Steps to generate ego relative 3D object tracks from raw camera data, using the basic 2d bounding box approach

Figure 2: Dashcam feed of a hard braking event. Object detection and distance tracking generated by the basic 2d bounding box networks trained by Nemo. Legend for box annotations — object track number, object lateral distance from ego, object longitudinal distance from ego.

The relative object tracks thus calculated can be converted into absolute object tracks using absolute position information of ego like GNSS.

This method benefits from the well researched 2d bounding box detection networks, which are light weight and fast. But, it comes with significant limitations overweighing the benefits —

The object track distance accuracy is highly dependent on the accuracy of the 2D bounding box detection, since we consider the points where the boxes touch the road surface to calculate the longitudinal and lateral distance to the objects.
Distance measurements of objects rely on flat world assumption which is not true in some cases like ramps and bridges
The method is not suitable for calculating distance to objects not connected to the road surface like traffic signs, lights etc.

Pseudo Lidar approaches — a step above 2d bounding box methods

Using Pseudo Lidar approach and combining it with bounding box detection, is another method one could explore for extracting 3D object tracks. Pseudo Lidar approaches are a hot topic of research in the computer vision field and refer to generating a dense pseudo point cloud using monocular camera images alone as input at test time. Figures 3 and 4 show the depth map outputs of popular Pseudo Lidar methods. Each pixel in the image is labeled with an inferred depth value using a deep learning network.

Pseudo Lidar based approaches can be supervised or self-supervised. The supervised approaches generally use time synced and calibrated Lidar point clouds as a source of ground truth during training and are more accurate than self-supervised approaches that use only geometrical constraints on image sequences as a source of supervision.

Figure 3: [Top] Original image, [Bottom] Corresponding depth map using Pseudo Lidar approach called On the left is the output from a method called Monodepth2

Figure 4: [Top] Original image, [Bottom] Corresponding depth map using Pseudo Lidar approach called Packnet

A conventional way of using pseudo lidar methods is to generate such pseudo lidar depth maps and run a Lidar based 3D object detection network to extract 3d bounding boxes for each object in the scene. But, the annotation needed to train 3D bounding box detection networks is expensive and time-consuming.

This got us thinking — on one end, we have reliable 2D object detectors trained on a large number of classes that enable good scene understanding. Also, annotations for the corresponding 2D bounding boxes are cheap and easy to get. Could we leverage these detectors and combine them with pseudo Lidar-based depth map generation to get object tracks? Such a system would require Lidar point cloud data at training time to learn pseudo lidar depth predictions but not the full 3d bounding box annotations! Coming up with an architecture for such a network that combines 2D object detectors, depth prediction outputs and tracking is the problem statement we formulated for our new (patent-pending) approach.

Nemo’s weakly-supervised framework for Object track generation, with superior accuracy and low training costs

In our new method, we propose to equip 2D object detection techniques with the ability to reason the 3D scene geometry and object’s motion dynamics. Then we jointly learn to detect, track and model objects in 3D, using only 2D bounding box annotations and tracks in a weakly supervised manner. The method leverages the scene and object geometry available via lidar (weak sparse supervision) to train perception networks to model objects and interactions in 3D, and the method uses only camera videos at test time.

We decided to extend a 2D center-point-based object detection technique to learn the whole scene’s depth. By doing so, we force the network to learn global scene-level scale to reason the scene’s 3D structure more accurately. The learned depth is then combined with 2D object tracks to output 3D object tracks! Learned depth is also used with previous frame tracklet information to generate 3D tracklets which enables better tracking of overlapping and occluded objects by our neural network.

The architecture of our system is briefly explained in Figure 5 along with sample intermediate and final outputs in Figure 6 and 7. A backbone network encodes the previous frame, track information, and current image information into an intermediate representation that is used by two heads to predict the object track information and scene depth respectively. The resulting network is quite lightweight and can run at ~31 FPS on a typical Nvidia desktop graphics cards, prior to any optimization.

Figure 5: Architecture of Nemo’s weakly supervised framework for 3D Object track generation

Figure 7: Final combined 3d object track output (distance relative to ego)

Results — accurate object tracks extracted from camera-only data

The resulting object tracks generated by our method on the KITTI tracking dataset (validation split) are shown below in Figures 8 and 9. We see that the resulting object tracks are accurate and suitable for extracting the full scenario description of the event.

Figure 8: Cyclist leading ego vehicle. [Top] Object tracks in birds-eye view superimposed on a map, [Bottom] Camera view of the same scene

Figure 9: Car overtaking ego vehicle from the left . [Top] Object tracks in birds-eye view superimposed on a map, [Bottom] Camera view of the same scene

Checkout for more updates on our new class of offboard perception networks powering Nemo’s data pipelines

In this blog, we detailed the first of our class of offboard perception networks — weakly-supervised camera-only object tracking that sift through petabytes camera data , extract accurate object tracks and feed into our scenario extraction pipeline. In the upcoming blog, we’ll elaborate on the challenge of adapting this network to different cameras and locations, which is commonly known as Domain Adaptation.

Contact us if you are interested in converting the raw data accumulated in your servers into searchable scenarios and events — there by transforming your big data problem into smart data frameworks !