Lane and Object Detection for Self Driving Cars with Road Signs Recognition

15 min readJun 12, 2021

Blog By Aniket Khosa, Ark Gupta, Raj Sancheti

This blog is about deep learning solution for lane , road signs and object detection in self driving cars which i with my team prepared for our final year project. You can find all code related to this project on the github repository mentioned.

What is Object Detection?

Object Detection is the process of finding real-world object instances like cars, bikes, TVs, flowers, and humans in still images or videos. It allows for the recognition, localization, and detection of multiple objects within an image, which provides us with a much better understanding of an image as a whole. It is commonly used in applications such as image retrieval, security, surveillance, and advanced driver assistance systems (ADAS).
Object detection can be done in multiple ways: Feature-based object detection Viola-Jones object detection SVM classifications with HOG features Deep learning object detection.

Applications of Object Detection

Facial Recognition:

A deep learning facial recognition system called the “DeepFace” has been developed by a group of researchers on Facebook, which identifies human faces in a digital image very effectively. Google uses its own facial recognition system in Google Photos, which automatically segregates all the photos based on the person in the image. There are various components involved in Facial Recognition like the eyes, nose, mouth, and eyebrows.

People Counting:

Object detection can be also used for people counting, it is used for analyzing store performance or crowd statistics during festivals. These tend to be more difficult as people move out of the frame quickly.
It is a very important application, as during crowd gathering this feature can be used for multiple purposes.

Industrial Quality Check:

Object detection is also used in industrial processes to identify products. Finding a specific object through visual inspection is a basic task that is involved in multiple industrial processes like sorting, inventory management, machining, quality management, packaging, etc.
Inventory management can be very tricky as items are hard to track in real-time. Automatic object counting and localization allow improving inventory accuracy.

Self-Driving Cars:

Self-driving cars are the future, there’s no doubt in that. But the working behind it is very tricky as it combines a variety of techniques to perceive their surroundings, including radar, laser light, GPS, odometry, and computer vision.

A General Framework for Object Detection

Typically, we follow three steps when building an object detection framework:
First, a deep learning model or algorithm is used to generate a large set of bounding boxes spanning the full image (that is, an object localization component)

Next, visual features are extracted for each of the bounding boxes. They are evaluated and it is determined whether and which objects are present in the boxes based on visual features (i.e. an object classification component)

In the final post-processing step, overlapping boxes are combined into a single bounding box (that is, non-maximum suppression)

What is MobileNet SSD?

SSD
The SSD architecture is a single convolution network that learns to predict bounding box locations and classify these locations in one pass. Hence, SSD can be trained end-to-end. The SSD network consists of base architecture (MobileNet in this case) followed by several convolution layers:

SSD operates on feature maps to detect the location of bounding boxes. Remember — a feature map is of the size Df * Df * M. For each feature map location, k bounding boxes are predicted.Each bounding box carries with it the following information:

4 corner bounding box offset locations (cx, cy, w, h)

C class probabilities (c1, c2, …cp)
SSD does not predict the shape of the box, rather just where the box is. The k bounding boxes each have a predetermined shape. The shapes are set prior to actual training. For example, in the figure above, there are 4 boxes, meaning k=4.

Loss in MobileNet-SSD
With the final set of matched boxes, we can compute the loss like this:
L = 1/N (L class + L box)
Here, N is the total number of matched boxes.

L class is the SoftMax loss for classification and ‘L box’ is the L1 smooth loss representing the error of matched boxes.

L1 smooth loss is a modification of L1 loss which is more robust to outliers. In the event that N is 0, the loss is set to 0 as well.

MobileNet
The MobileNet model is based on depth wise separable convolutions which are a form of factorized convolutions. These factorize a standard convolution into a depth wise convolution and a 1 × 1 convolution called a pointwise convolution.
For MobileNet, the depth wise convolution applies a single filter to each input channel. The pointwise convolution then applies a 1 × 1 convolution to combine the outputs of the depth wise convolution.
A standard convolution both filters and combines inputs into a new set of outputs in one step.
The depth wise separable convolution splits this into two layers — a separate layer for filtering and a separate layer for combining. This factorization has the effect of drastically reducing computation and model size.

Steps for Loading the Object Detection Model
Step1: Install the Model
Installing and getting basic tools

Installing pycocotools

Get `tensorflow/models` or `cd` to parent directory of the repository.

Compile protobufs and install the object_detection package

Importing files

Import the object detection module.

Patches:

Step2: Model Preparation

Variables
Any model exported using the export_inference_graph.py tool can be loaded here simply by
changing the path.By default we use an "SSD with Mobilenet" model here
Loader

Loading label map

Step3: Detection
Load an object detection model:

Check the model's input signature, it expects a batch of 3-color images of type uint8:

And returns several outputs:

Add a wrapper function to call the model, and cleanup the outputs:

Run it on each test video and show the results:

Given the path of the videos, while it converts internally:

Lane Detection

Identifying lanes on the road is a common task performed by all human drivers to ensure their vehicles are within lane constraints when driving, so as to make sure traffic is smooth and minimise chances of collisions with other cars in nearby lanes. Similarly, it is a critical task for an autonomous vehicle to perform. It turns out that recognising lane markings on roads is possible using well known computer vision techniques. We will cover how to use various techniques to identify and draw the inside of a lane, compute lane curvature, and even estimate the vehicle’s position relative to the center of the lane.

To detect and draw a polygon that takes the shape of the lane the car is currently in, we build a pipeline consisting of the following steps:

Computation of camera calibration matrix and distortion coefficients from a set of chessboard images
Distortion removal on images
Application of color and gradient thresholds to focus on lane lines
Production of a bird’s eye view image via perspective transform
Use of sliding windows to find hot lane line pixels
Fitting of second degree polynomials to identify left and right lines composing the lane
Computation of lane curvature and deviation from lane center
Warping and drawing of lane boundaries on image as well as lane curvature information

1. Camera Calibration & Image Distortion Removal

Image distortion occurs when a camera looks at 3D objects in the real world and transforms them into a 2D image. This transformation isn’t always perfect and distortion can result in a change in apparent size, shape or position of an object. So we need to correct this distortion to give the camera an accurate view of the image. This is done by computing a camera calibration matrix by taking several chessboard pictures of a camera and using cv2.calibrateCamera() function.

To compute the camera the transformation matrix and distortion coefficients, we use multiple pictures of a chessboard on a flat surface taken by the same camera. OpenCV has a convenient method called findChessboardCorners that will identify the points where black and white squares intersect and reverse engineer the distorsion matrix this way. The image below shows the identified chessboard corners traced on a sample image:

chessboard corners traced on a sample image

distorted vs undistorted

2. Apply a distortion correction to raw images.

The calibration data for the camera that was collected in step 1 can be applied for raw images to apply distortion correction. An example image is shown here in Fig 3. It may be harder to see the effects of applying distortion correction on raw images compared to a chessboard image, but if you look closer at right of the image for comparison, this effect becomes more obvious when you look at the white car that has been slightly cropped along with the trees when the distortion correction was applied.

Fig 3. Before and after results of un-distorting an example image

3. Use color transforms, gradients, etc., to create a thresholded binary image.

The idea behind this step is to create an image processing pipeline where the lane lines can be clearly identified by the algorithm. There are a number of different ways to get to the solution by playing around with different gradients, thresholds and color spaces. I experimented with a number of these techniques on several different images and used a combination of thresholds, color spaces, and gradients. I settled on the following combination to create my image processing pipeline: S channel thresholds in the HLS color space and V channel thresholds in the HSV color space, along with gradients to detect lane lines. An example of a final binary thresholded image is shown in Fig 4, where the lane lines are clearly visible.

Fig 4. Before and after results of applying gradients and thresholds to generate a binary thresholded image

4. Apply a perspective transform to generate a “bird’s-eye view” of the image.

Images have perspective which causes lanes lines in an image to appear like they are converging at a distance even though they are parallel to each other. It is easier to detect curvature of lane lines when this perspective is removed. This can be achieved by transforming the image to a 2D Bird’s eye view where the lane lines are always parallel to each other. Since we are only interested in the lane lines, I selected four points on the original un-distorted image and transformed the perspective to a Bird’s eye view as shown in Fig 5 below.

Fig 5. Region of interest perspective warped to generate a Bird’s-eye view

5. Detect lane pixels and fit to find the lane boundary.

To detect the lane lines, there are a number of different approaches. I used convolution which is the sum of the product of two separate signals: the window template and the vertical slice of the pixel image. I used a sliding window method to apply the convolution, which will maximize the number of hot pixels in each window. The window template is slid across the image from left to right and any overlapping values are summed together, creating the convolved signal. The peak of the convolved signal is where the highest overlap of pixels are and it is the most likely position for the lane marker. Methods have been used to identify lane line pixels in the rectified binary image. The left and right lines have been identified and fit with a curved polynomial function. Example images with line pixels identified with the sliding window approach and a polynomial fit overlapped are shown in Fig 6.

Fig 6. Sliding window fit results

6. Determine the curvature of the lane and vehicle position with respect to the center of the car.

I took the measurements of where the lane lines are and estimated how much the road is curving, along with the vehicle position with respect to the center of the lane. I assumed that the camera is mounted at the center of the car.

7. Warp the detected lane boundaries back onto the original image and display numerical estimation of lane curvature and vehicle position.

The fit from the rectified image has been warped back onto the original image and plotted to identify the lane boundaries. Fig 7 demonstrates that the lane boundaries were correctly identified and warped back on to the original image. An example image with lanes, curvature, and position from center is shown in Fig 8.

Fig 7. Lane line boundaries warped back onto original image

Fig 8. Detected lane lines overlapped on to the original image along with curvature radius and position of the car

Output Result lane detection

A General Framework for Road Signs Detection:

ResNet-50:

ResNet, short for Residual Networks is a classic neural network used as a backbone for many computer vision tasks. This model was the winner of ImageNet challenge in 2015. The fundamental breakthrough with ResNet was it allowed us to train extremely deep neural networks with 150+layers successfully. Prior to ResNet training very deep neural networks was difficult due to the problem of vanishing gradients.

AlexNet, the winner of ImageNet 2012 and the model that apparently kick started the focus on deep learning had only 8 convolutional layers, the VGG network had 19 and Inception or GoogleNet had 22 layers and ResNet 152 had 152 layers. In this blog we will code a ResNet-50 that is a smaller version of ResNet 152 and frequently used as a starting point for transfer learning.

ResNet first introduced the concept of skip connection. The diagram below illustrates skip connection. The figure on the left is stacking convolution layers together one after the other. On the right we still stack convolution layers as before but we now also add the original input to the output of the convolution block. This is called skip connection.

ResNet is a short name for Residual Network. As the name of the network indicates, the new terminology that this network introduces is residual learning.

What is the need for Residual Learning?

Deep convolutional neural networks have led to a series of breakthroughs for image classification. Many other visual recognition tasks have also greatly benefited from very deep models. So, over the years there is a trend to go more deeper, to solve more complex tasks and to also increase /improve the classification/recognition accuracy. But, as we go deeper; the training of neural network becomes difficult and also the accuracy starts saturating and then degrades also. Residual Learning tries to solve both these problems.

What is Residual Learning?

In general, in a deep convolutional neural network, several layers are stacked and are trained to the task at hand. The network learns several low/mid/high level features at the end of its layers. In residual learning, instead of trying to learn some features, we try to learn some residual. Residual can be simply understood as subtraction of feature learned from input of that layer. ResNet does this using shortcut connections (directly connecting input of nth layer to some (n+x)th layer. It has proved that training this form of networks is easier than training simple deep convolutional neural networks and also the problem of degrading accuracy is resolved.

This is the fundamental concept of ResNet.

ResNet50 is a 50 layer Residual Network. There are other variants like ResNet101 and ResNet152 also.

For the proper structure of ResNet-50 you can visit-

Steps for Loading the Road Sign Detection Model:

First of all, I segregated classes and then selected the one which we thought of to go with. We selected 23 classes which includes all the basic classes which are found on the roads.
Secondly, we selected around 23 thousand and 400 images. And trained it into a pre-trained ResNet 50 model with the inputs of the images as the classes specified by us.
ResNet-50 is a convolutional neural network that is 50 layers deep. You can load a pre-trained version of the network trained on more than a million images from the Image-Net database. The pre-trained network has input as our classes which we have specified and as an output it forms out the tensorflow lite file. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224.
We take the tflite file as an output to our module, which we further input it into the application. In the input part, we first have designed the application according to us and have taken that trained model as an input to our application. It uses the camera on your phone in order to detect the objects.
Then we specified the percentage of detection and which of the three traffic signs matches the live detected sign, as it is detecting, app also includes the feature of the detection percentage (%), i.e. what is the chances that is gets detected and how accurate it is.
As soon as we bring the camera near to a sign, it detects it out and displays the percentage as an indicator to how much it is detected.

Requirements for the development of the App

Android Studio 3.2 (installed on a Linux, Mac or Windows machine)
Android device in developer mode with USB debugging enabled
USB cable (to connect Android device to your computer)

Below you can see the classes we have selected:

For the testing part we have used the test dataset which was given already into the dataset of German Traffic Sign Recognition Benchmark. For that we took 4000 images and had a validation of them, it seems out that major chunk of the data was detected with an average accuracy of 95%. This shows that our module not only gate the better prediction but had consumed less space as you can have a small app into your phone and can carry it wherever you want.

RESULTS ROAD SIGNS:

Results shown here are the images that we selected from the test dataset which weren’t included in the training dataset. You can see the results with the detection of the images as they were perfectly detected in majority of cases with 100% accuracy. As this is real time detection, it may vary at any particular instance because of the detection angle, stability, direction of the movement etc. But the accuracy model gives is notable and can be used to inculcate any other futures with better number of classes as well as camera stability and many more. Given below you can see the structure as well as the accuracy of detection: