Author: Aryan Pal
Peer Reviewer: Aaryan Sukhadia
Professional Reviewer: David Xu
Abstract
Road injuries are the leading non-illness cause of death globally, claiming 1.35 million lives every year. Intersections are a major contributor to this number, yet systems created to improve the safety of intersections lack in cost-effectiveness, compatibility, and privacy. The system proposed in this paper utilizes machine learning and computer vision to create an inexpensive, universally compatible, and privacy ensured system that requires limited hardware, namely a camera and a processor. The system proposed in this paper was found to be highly capable with an accuracy of over 80% in object tracking and 80% accuracy in velocity analysis, which are two essential components of the system, at frame rates as low as 6 frames per second The prototype system solves the issues of cost-effectiveness, compatibility, and privacy associated with previous systems, making it an ideal system for collision prevention at intersections.
Introduction
Road injuries claim 1.35 million lives every year, making them the most common non-illness related cause of death[1]. Road injuries also account for more deaths around the world than HIV/AIDS and tuberculosis, and are the number one cause of death amongst people ages 5-29[1]. The World Health Organization’s Global Health Observatory Data shows a steady increase in road-injury related fatalities with such fatalities being the tenth leading cause of death in the world in 2000, climbing to ninth in 2015, and to the eighth spot on the list in 2016[7]. Intersections are a major contributor to these road injury related casualties, accounting for more than 40% of collisions in the United States[6]. Proposed systems for collision prevention at intersections lack cost-effectiveness and often require non-standard vehicular technology. These factors limit the systems to well-funded governments and vehicles with specialized technologies.
Low-Income Countries
The effect of traffic collisions in low-income countries is especially alarming. Low-income countries face higher than three times the death rates of high-income countries, despite being home to only 1% of the world’s vehicles[1]. While road injury-related deaths are a global problem, it is clear that the problem peaks in low-income countries where the funding for improved safety measures is limited. A well-supported statistical trend shows that across the world, the rate of road injury-related deaths per 100,000 people increases as income decreases[1]. In order to minimize road-injury related deaths in these countries, a cost-effective solution is needed that can be widely implemented in low-income countries.
Current Systems for Vehicle Detection
There are existing systems in place for the detection of vehicles at intersections. The most common system is the induction loop [16]. This system is installed underneath the roadway and consists of an electric current moving through a series of wires. The current can be disrupted and changed when a large metal mass passes or stops over it, such as a vehicle. The disruptions in current therefore act as the signal that a vehicle has passed or stopped over the loop. The installation of these systems, however, is labor-intensive and expensive. A traditional complete induction loop system costs more than $2,000 for production and installation [17]. Additionally, these systems have many limitations and shortcomings which reduce their effectiveness. The current is only influenced by vehicles above a certain metal mass, which means the loop can fail to detect smaller vehicles such as motorcycles. In addition, because these systems only detect vehicles on top of them, they have a limited range of detection. Although installing the system over a larger area seems like a simple solution, the high cost prevents this option from being economically viable. There is a need for a system that can observe a large area of the intersection to enable vehicle detection on a larger scale.
Proposed Systems for Intersections
As of recent, signalized intersections have been widely targeted to reduce the number of deaths caused by traffic collisions. Due to the existing infrastructure in place, signalized intersections are ideal for implementing additional technology. Advances in the fields of sensors and wireless communication systems have led to numerous proposed systems for increased safety at traffic intersections using vehicular communication [2],[3],[5],[14]. These systems are referred to as Cooperative Intersection Collision Avoidance Systems (CICAS). However, these systems often require expensive components or non-standard vehicular technology. For example, a proposed CICAS by [2] is dependent on the installation of expensive onboard vehicular equipment such as a computing system, a Dedicated Short Range Communication Radio (DSCR), and a driver vehicle interface designed only for use with the system. These factors prevent the system from being implemented on a large scale. More importantly, these factors also create accessibility limitations for low-income countries, where the problem is at its worst. In addition, high vehicular density can negatively impact CICAS due to their dependency on vehicular communication [15]. This suggests that the system’s effectiveness would be reduced in urban settings and areas with high traffic volume in general. CICAS also creates privacy concerns for the drivers involved, as network attackers can track vehicles and gain access to a driver’s personal information such as their home or work address [5]. Vehicular communication systems require heavy security measures and multiple safety features for defending against numerous network attacks [17]. The system’s vulnerability causes focus to be placed on the security of the system rather than the effectiveness of it.
Machine learning and computer vision have been applied in many proposed traffic management systems for intersections as well [8]-[12]. However, these systems place their focus on the efficiency and analysis of traffic flow rather than safety. Generally, these systems operate by measuring traffic density, usually by counting the vehicles present through the use of an object detector, at the sides of an intersection to adjust the light timings accordingly. The wide use of machine learning and computer vision in intersection-based systems supports their validity and usefulness in the setting.
Objective
The limited detection area of current vehicle detection systems reduces their effectiveness on larger roadways such as highways, and these systems lack the cost-effectiveness required for large-scale implementation. Systems designed for increased safety tend to lack compatibility and cost-effectiveness, requiring many non-standard vehicular and infrastructural technologies. Currently, no cost-effective solution for enhancing safety at intersections through standard vehicular technology and intersection infrastructure exists.
The objective of this project was to create an inexpensive, camera-based object detection and tracking system to enhance the safety of signalized intersections in a way that makes the system universally compatible. The system would consist of two main components: a camera and a processor. The limited amount of necessary hardware means the system would be inexpensive, allowing for it to be widely implemented even in low-income cities and countries. The camera enables the system to survey and analyze a large area, which is important for large roadways. The camera captures and provides images of the intersection for the processor to analyze. The processor makes the decision to turn the traffic light either red or green based on the information gathered from the images. Specifically, the system would be checking for a vehicle increasing speed during a red light, as this would suggest a driver is attempting to run the light.
Methodology
System Design
At an intersection, drivers typically slow down as they approach a red light. However, drivers looking to run a red light tend to do the opposite and increase their vehicle’s speed. Therefore, it was determined that the best approach to preventing collisions was to check for vehicles increasing their speed when approaching a red light. In order to make the system compatible with any intersection and vehicle in the world, it was decided to design the system around the traffic light.
Depicted in Fig. 1 is a standard traffic intersection with four traffic lights and two directions of travel. Two traffic lights control traffic in the horizontal axis and two lights control the vertical axis. Generally, two lights controlling one axis would be green while the two lights controlling the other axis would be red. Once the two green lights turn red, the two red lights would turn green. However, during this process there is a pause between the green lights turning red and the red lights turning green. During this pause, all lights at the intersection are red, and it is during this time that the system would run, as shown in Fig. 1. The two red lights that should turn green would do so only once the system detected no increase in the speed of any vehicle for a given amount of time. For example, in Fig. 1, once no increase in vehicle speed was detected, the traffic lights controlling the horizontal axis would turn green. In the scenario where the light is held from being green, the time taken will cut into the total green time for the light. This is important in order to enhance compatibility with existing systems that coordinate traffic over multiple intersections.
Materials and System Hardware
To combat the problem of a limited detection area, it was determined that a camera would be best suited for capturing images of the intersection. The system design consisted of a single fisheye camera mounted to the bottom of one traffic light at an intersection, which would provide a view of all sides of the intersection. In order to analyze the captured images, a processor that was capable of performing the associated detection and tracking algorithms in real-time was needed. For purposes of efficient development and testing, a webcam was used as the camera and Google Colab, a cloud-based programming environment which provides access to free GPU’s, was used as the processor.
System Software
Once an image was captured by the camera, the first processing task was object detection. The importance of both accuracy and computational efficiency in the proposed system led to a decision to use the YOLO (You Only Look Once) Model to perform this task. Version 3 of the YOLO model, used in the proposed system, was shown to be state of the art in both inference time, and accuracy[18]. A TensorFlow implementation of a pre-trained YOLOv3 model was used[19]. When the model was run on an image, it generated bounding box information for every detected object. The bounding box information was the x and y-coordinates of the top left corner of the box, the x and y-coordinates of the bottom right corner of the box, and a class number. The class numbers were mapped to class names such as car, truck, and motorcycle which allowed the system to distinguish between vehicle and non-vehicle detections.
The object tracking approach used was to utilize the midpoints from a bounding box to track it in consecutive frames. The function containing the tracking algorithm had two parameters: a box and a frame. The box parameter contained the bounding box information of the object that needed to be tracked. The frame parameter consisted of all the bounding boxes from the image that the object needed to be tracked. The function’s purpose was to return the bounding box from the frame parameter that matched the box parameter.
The algorithm began with the midpoint of the box parameter being calculated. Then, the midpoint of each bounding box in the frame filter was calculated. The match from the frame parameter was determined to be the bounding box with the most similar midpoint coordinates to the box parameter. A class filter was applied at this step, whose purpose was to ensure that the bounding box being checked from the frame parameter was the same class as the box parameter. This technique improved accuracy, as it guaranteed no matches would be made between objects of different classes. For example, a car would not be matched to a truck or a bus as these vehicles are of a different class. The filter also improved the efficiency of the algorithm, as all objects of a class different than the box parameter were not checked. Once the function checked all qualifying bounding boxes in the frame parameter, the match had been determined and was returned by the function.
The detection model and tracking algorithm were incorporated into the velocity analysis function. Analyzing an object’s change in speed required at least three consecutive frames, and therefore, the velocity analysis function was designed to take in three parameters (framet, framet-1, and framet-2 ) which represent the bounding boxes from the current frame, previous frame, and the frame before the previous respectively, as shown in Fig. 2. The function also took in a fourth parameter which indicated whether the traffic light that the camera was attached to was the one turning red. This was necessary to determine in which direction the system would check for vehicles increasing in speed. For example, if the value of this parameter was true, that meant the traffic light that the camera was attached to was turning red, and a driver intending to run a red light would do so while moving in the vertical direction of the camera’s view. This situation is represented in Fig. 1. Similarly, when the parameter value was false, increases in vehicle speed in the horizontal direction were checked for. The algorithm looped through every bounding box in framet-2 and tracked the bounding box in both framet-1 and framet using the object tracking function. A class filter was also used here to skip over any objects that were not a vehicle or pedestrian. The reason for this is that the model is able to detect objects such as birds or trees, which are irrelevant to the system and therefore were not checked for matches. The bounding box matches determined from framet-1 and framet by the object tracking function were saved in two variables, match2 and match3 respectively. At that point, the object’s location in three consecutive frames had been determined. The change in the speed of the object was then calculated by evaluating the change in its midpoint between the three frames. If the vertical direction was being checked, then the change in the y-coordinate of the midpoint was evaluated. If the horizontal direction was being checked, then the x-coordinate of the midpoints was evaluated.
A variable was created to hold the change in the midpoint position between framet-2 and framet-1, and a second variable was used to store the change in the midpoint from framet-1 to framet. If the change from framet-1 to framet was greater than the change from framet-2 to framet-1, this suggested that the object was speeding up and the function would return true. If the change in position decreased across frames, then the function would return false. The velocity analysis returned the boolean value of the statement below.
|framet-framet-1| > |framet-1-framet-2|
The final step in creating a prototype system was to enable the system to control the status of traffic lights at the intersection based on the results from the velocity analysis function. Since the system was designed around the traffic light, the only information available through the light itself would be its color. Given that the system runs only when all lights are red, the direction that needed to be checked for movement could be determined by checking if the light that the camera was attached to was turning from yellow to red. If it was, the system would know to check for vehicle speed increasing in the vertical direction of its view, and if it was already red then the system would know to check for vehicle speed increasing in the horizontal direction of its view. Therefore, the control function was designed to take in only one boolean parameter indicating whether the light that the camera was attached to was the one turning red. The function could be programmed to check that no vehicle was increasing speed for either a number of consecutive frames, or for an amount of time. Whereas time is a universal measure, the amount of consecutive frames can vary in the time they take depending on the camera’s frames per second. In the prototype system, the function was made to check that no vehicles were increasing speed for ten frames. The control function began with three images being captured by the camera. The object detection model was run on each of the images and the resulting bounding boxes from the three images were stored in three separate variables (boxest-2, boxest-1, and boxest, where boxest represents all bounding boxes from the current frame, boxest-1 represents all bounding boxes from the previous frame, and boxest-2 represents all bounding boxes from the frame before the previous). At this point, the three variables could be passed into the velocity analysis function. The boolean parameter passed into the control function was used as a parameter to the velocity analysis function as well. A counter variable was used to keep track of how many consecutive frames had been analyzed without the velocity analysis function returning true. If the velocity analysis function returned true, meaning a vehicle was increasing in speed, the counter variable would be reset to zero. If the velocity analysis function returned false, the counter variable was incremented by one, indicating that no vehicle was increasing in speed across the three frames analyzed. Once the velocity analysis was run on the first three images, the next set of three images was needed. The three variables which held the bounding boxes of the three images, and used as the parameters for the velocity analysis function, were changed to hold the new set of images. Thus, boxest-2 now held the bounding boxes from boxest-1, boxest-1 now held the bounding boxes from boxest, and boxest-2 now held the bounding boxes from the new frame. This process is shown below in Fig. 3. Once the threshold for the consecutive amount of frames was met, the control function turned the light green.
Verification
A limitation of the proposed system was the availability of video or images that could be used for testing. Due to the proposed system making use of a camera attached to a traffic light, which is not currently existent in a system, no video or images from a similar angle exist to verify the system’s ability. However, this limitation was combated by analyzing the ability of individual components of the system in their respective tasks. Since the YOLO object detection model used was well established and pre-trained, only the object tracking algorithm and the velocity analysis algorithm were tested for ability. If these components functioned correctly, the system as a whole could be expected to do the same.
The object tracking algorithm was tested on the Multiple Object Tracking Challenge 2017 Test 13 Dataset [20], which had over thirty-five objects in each frame. This dataset was chosen because it mirrored the high-density conditions of city and highway settings. The ground truth from the dataset contained all objects across the 750 frames of the video, along with every object’s bounding box in each frame that it was present in. For evaluation, the algorithm was tested on its ability to track each object across three frames, as this is the same task it faces within the proposed system. The ground truth file was used for testing, and the algorithm was tested at different frame per second values to determine an adequate frame rate. The original video was taken at 25 frames per second, so to evaluate the algorithm at the desired frame rate, the correct interval was first determined. The interval was calculated as the result of dividing 25 by the desired frames per second. For example, to evaluate the algorithm at 5 frames per second, the interval was determined to be 5 frames (25/5=5). Decimal intervals were rounded to the nearest whole number. It should be noted that the ground truth file used for testing contained no indication of the object’s class. Therefore, the algorithm was tested without a class filter, which could be expected to further reduce its accuracy in the test. The velocity analysis function was tested on 10 samples of vehicles increasing in speed as they approached a red light and 10 samples of vehicles decreasing in speed as they approached a red light. Each sample consisted of three consecutive images as this was the same input the algorithm functioned on within the system. Each computation was timed, and across the nine different frame rates tested, an average computation time of 43.7 milliseconds was recorded.
Results
The object tracking algorithm’s performance increased significantly across the different frames per second rates tested. Under high-density conditions and without a class filter, the algorithm was able to perform well with an accuracy of over 73% at as low as four frames per second, as seen in Fig. 4.
The velocity analysis function achieved high accuracy in detecting both increasing vehicle speed, and decreasing vehicle speed. Out of the ten samples tested for increasing vehicle speed, the algorithm was able to correctly identify a vehicle increasing speed in eight of the samples(80%). Similarly, across ten samples of the decreasing vehicle speed, the algorithm was able to correctly identify that vehicle was increasing in speed for eight out of ten samples(80%).
Although an accuracy of 100% is desirable for a system of this purpose, the accuracy determined through testing is high for a novel system with much potential to improve. These results indicate that the control function, which is the complete system, would function with an overall high accuracy as well. This is because the decision made by the control function is the end result of each individual component working together.
Conclusion
Through the use of machine learning and computer vision, an effective system was prototyped for collision prevention at intersections. The system consisted of two basic hardware components. Use of a single board computer as the processor, such as a Raspberry Pi Zero, and a compatible fisheye lens camera, such as a Raspberry Pi Fisheye Lens Camera, would cost $45. This cost makes the system extremely accessible, even for low-income countries. The prototyped system supports the idea that a machine learning and computer vision approach can be applied to create a simple, yet accurate collision prevention system. The system design makes it compatible with any traffic light system and vehicle in the world, which, when coupled with its cost-effectiveness, allows for no limitation to wide-scale implementation.
The lack of testing data was the main limitation in the creation of the system. Additionally, the system design does not account for cases where a vehicle may decrease its speed but is still unable to stop before light. These limitations will be further combated through additional testing of the system’s individual components and design additions that can account for more cases. The results of the initial system prototype suggest that alongside further testing and design expansion, the next stage of development should follow, namely an advanced prototype that functions independently.
This prototype would include a camera that utilizes a fisheye lens so that all lanes of the intersection would be included in captured images, as well as an independent processor such as a single board computer to run the software. The object tracking algorithm can be improved upon by factoring in features such as object shape and color to improve accuracy. The traffic flow focus of other machine learning and computer vision-based systems can be added to the proposed system as well due to its ability to detect and count vehicles. Additionally, due to its ability in object tracking, the system can be expanded to detect pedestrians crossing the road and accordingly stop the traffic light from turning green, effectively preventing a potential collision.
The potential for the system to improve road safety and save lives is evident, however its benefits extend beyond that. A prototype system with the proposed hardware is more than $1900 cheaper than the average induction loop system used today. The low cost of the proposed system will allow widespread implementation of the system by governments of even underdeveloped and third world countries who are most in-need of such systems.