Emergency Alerts using Machine Learning

Published by Edwin Sam on


In recent times, crime rates in metropolitan areas have multiplied. A lot of time, manpower, and resources are spent conducting mundane investigations to detect accidents. The fact is that until today the monitoring process is done manually and consumes a lot of time. As manned security cannot be maintained in large spaces, this requires considerable human effort, further supporting the cause of the proposed solution.

In this article, I propose an accident detection system based on the CNN model architecture to use live CCTV video to enhance public safety and law enforcement. My project used various image classification pre-trained CNN models to detect road accidents, and my system successfully classifies the accident while notifying relevant authorities. The Models that I have tested out are VGG-16, ResNet, etc. The dataset mainly consists of videos from UCF Crime, RWF 2000, and raw YouTube Videos. After rigorous training, the model constructed in this way can independently predict the mishappening with accuracies ranging from worst-case 85% to best-case 93%.

The rate of road accidents in metropolitan areas is increasing day by day. A significant portion of the workforce is dedicated to maintaining civil law and order, requiring more automation in this area rather than human resources. A possible human error could result in a series of catastrophic events, further supporting my proposal/motivation to develop an end-to-end solution. By designing a scalable, efficient model that uses minimal resources, I have successfully eliminated the need for multiple monitors and a PC.

Considering the above situation, a deep convolutional Neural Network model is used that requires little human intervention. I validated different pre-trained deep learning models such as VGG-16, InceptionV3, and ResNet to find a balance between accuracy and precision and the time taken to give the results. I have chosen the VGG-16 as my base model from the results obtained. I have input a Livestream feed which will get split down into smaller clips; then, the model generates frames from the clip based on a particular FPS. The frames are then sent to the model for prediction, and the results will be obtained in real-time.

My model’s prediction time buffer has been minimal, and detection is close to real-time. Consolidated alerts to supervisors and relevant local authorities to further enhance the safety of members of society. Since most new construction areas are more likely to have a CCTV system installed nearby, the initial cost/barrier to entry is met, making the model easier to deploy. Model performance is measured using his four parameters: precision, recall, accuracy, and F1 score.


For civil protection and competent law enforcement, advanced security systems are required to deal with road accidents in metropolitan areas. The vast amount of data generated by video surveillance systems requires automation to make the process less cumbersome.

Population growth in urban areas of densely populated countries like India makes it very difficult to patrol high-crime areas. This lack of management has increased insecurity in these areas. My real-time road accident detection system solves these problems to a great extent efficiently. Looking at monitoring systems and using machine learning models can help in two ways-

First, it reduces the need for police officers to patrol the streets manually. Secondly, machine-based systems are unbiased and avoid human error.


  1. Literature review or background research
  2. Understanding the architecture of CNN models
  3. Learn to cleanse and analyse data
  4. Data collection and pre-processing
  5. Learn machine learning
  6. Explore and experiment with different machine learning models
  7. Model design and implementation
  8. Model check and fix
  9. Future Scope

Literature Review

In the field of machine learning, much research has been done in the past to help detect crime over the last ten years. Many researchers have developed innovative and technically efficient methods and algorithms. Some of the recent work done in this area is described below.

Li et al. [2] proposed one lightweight action recognition architecture based on deep neural networks using RGB data. It consists of Convolution Neural Networks (CNN), Long Short-Term Memory (LSTM) units, and a temporal-wise attention model. It consists of three main components: the Input processing model, the Single frame representation model, and the Activity recognition model.

Neidet et al.[3] and his team published a paper using an intelligent traffic accident detection system in which vehicles exchange micro vehicle sizes with each other. Using random forests, they get 91.56% performance.

Next comes another work by Nejdet et al. to photos intended to analyze traffic behavior. A vehicle that behaves differently from the current traffic behavior is likely to cause an accident, and the results showed that the clustering algorithm could successfully detect accidents.

Research Methodology


I have studied various pre-trained image classification CNN models like VGG-16, ResNet, Inception-V3, and MobileNet, and validate them against datasets collected from multiple sources. Based on their precision, accuracy, and F1 score, I have validated them against the MY dataset to check which one suits us.

Data Collection:

I collected and isolated around 25,000 labeled videos from multiple sources so that they can be used to train separate models designed to detect these situations/accidents. I also did some research on understanding these.

Understanding Image Classification

The image classification problem looks like this: Given a set of images, each labeled with one category, we are asked to predict these categories against a new set of test images and measure the accuracy of our predictions. This task presents challenges such as viewpoint changes, scale changes, intraclass changes, image deformation, image occlusion, lighting conditions, and background noise.

How can we create an algorithm that can classify images? Different categories? Computer vision researchers have developed a data-driven approach to solving this problem. Instead of directly specifying in code what each image category of interest should look like, we can provide the computer with many examples of each image class and look at those examples to learn about the visual appearance of each class. We can develop a learning algorithm. First, we collect a training dataset from labeled images and feed it into our computer to familiarise ourselves with the data.

Considering this fact, a complete image classification pipeline can be formalized as follows:

  • Input is a training dataset consisting of N images, each labeled with a class different from K is attached.
  • This training set is then used to train a classifier that learns what each class looks like.
  • Finally, we assess the classifier’s quality by asking it to predict labels for new sets of images that have never been seen before. Then compare the true labels of these images with the labels predicted by the classifier.

Convolutional Neural Networks

A convolutional neural network (CNN) is the most common neural network model used for image classification problems. The big idea behind CNN is that a local understanding of the images is enough. The real benefit is that the training time is much faster with fewer parameters, and the amount of data required to train the model is reduced. Instead of a fully connected network of weights from each pixel, the CNN has enough weights to look at a small portion of the image. It’s like reading a book through a magnifying glass. After all, we are reading the entire page, but we are only looking at a small portion of the page at the time.

In 2012, a revolution happened. A new deep learning algorithm breaks records at the annual ILSVRC computer vision competition. A convolutional neural network called Alexnet.

Understanding the architecture of CNN

Imagine a 256×256 image. CNNs can be efficiently scanned one by one. Let’s say it’s a 5×5 window. As shown below, a 5×5 window slides along the image (usually left to right, top to bottom). The speed at which we slide is called stride length. For example, with a step size of 2, a 5-by-5 ​​sliding window moves two pixels at a time until it fits the entire image.

The convolution is the weighted sum of the image’s pixel values as the window slides across the image. For images with weight matrices, we can see that this convolution process produces different images (same size by convention). Convolution is the process of applying convolution.

The sliding window prank happens in the convolutional layers of neural networks. A typical CNN has multiple convolutional layers. Typically, each convolutional layer produces many alternating convolutions, so the weight matrix is ​​a 5 × 5 × n tensor, where n is the number of convolutions.

For example, suppose an image is passed through the convolution plane of a 5-by-5-by-64 weight matrix, producing 64 convolutions by moving a 5 × 5 window. Therefore, the model has 5 × 5 × 64 (= 1,600) parameters, significantly less than the fully connected network’s 256 × 256 (= 65,536).

The beauty of CNN is that the number of parameters does not depend on the size of the original image. We can run the same CNN on a 300×300 image, and the number of parameters remains the same in the convolutional layers.

a) The different layers of a CNN

Convolutional neural networks have four types of layers: convolutional layers, pooling layers, ReLU modification layers, and fully connected layers.

i)The convolutional layer is the key component of convolutional neural networks and is always at least their first layer.

ii)The pooling layer is often placed between two convolution layers: it receives several feature maps and applies the pooling operation to each. The pooling operation reduces the images’ size while preserving their essential characteristics.

iii)ReLU (Rectified Linear Units) refers to real nonlinear functions defined by ReLU(x)=max(0, x). Visually it looks like this:

The ReLU correction layer replaces all negative values received as inputs by zeros. It acts as an activation function.

iv)The fully-connected layer

The final fully connected layer classifies the images as inputs to the network. Returns a vector of size N. where N is the number of classes in the image classification problem. Each element of the vector gives the probability that the input image belongs to the class.

b) Understanding Confusion Matrix

It is a performance measurement for machine learning classification problems where output can be more than one class. It is a table containing four different combinations of predicted and actual values.

It is beneficial for measuring Recall, Precision, Specificity, Accuracy, and, most importantly, AUC-ROC curves.

Let’s understand TP, FP, FN, TN-

  • True Positive: Interpretation: We predicted positive, and it’s true.
  • True Negative: Interpretation: We predicted negative, and it’s true.
  • False Positive: (Type 1 Error) Interpretation: We predicted positive, and it’s false.
  • False Negative: (Type 2 Error) Interpretation: We predicted negative, and it’s false.


For this part, split the video into individual frames and select the frames at a specific frame rate (1 in 7 frames in this case). The RGB values of input frames are then augmented and resized to 128 x 128 x 3. These transformed frames were then added to a list. Then I converted the list into a NumPy array and normalized the RGB values to minimize prejudice. I further split this data into two components: training and validation.


The Road, accident detection model was trained for vehicles in close proximity that had not met with an accident but might collide in the future.


For this model, I have used VGG-16 as the base model.

Dataset used to train this model:

Car Accident Detection and Prediction (CADP)-

Accident: 50 Videos (2.14GB)

Non-Accident: 25 Videos (535MB)

The difference between the accident video and the no-accident video was kept drastic (Ratio: 2:1) so that I can ensure that my model never misses any accidents because accidents can happen if we don’t notice them right away and can prove to be fatal. Thus, it was essential to have a good recognition value. On training this model on my dataset, I got a precision of 87.9% and a recall of 99.4%.


To train the model, I had to choose a pre-trained model that provided the highest accuracy when training the model on my dataset. So I validated some pre-trained models using the subset of my dataset for performing comparative analysis between different pre-trained models to choose the best model for my use case.

The pre-trained model I chose is the standard image processing model VGG-16 present in the Keras library [4] in TensorFlow [5].


I have developed this system for safety purposes in residential areas, and the users can easily connect the CCTV cameras installed in their area to the concerned system.

Live video from a CCTV camera is split into frames and passed through my trained model to detect any road accident. When such situations occur, the results are displayed on the system, and automatic alerts will be sent to the concerned person.


Figure: Working of the System

Livestream Capture and Breaking Video Stream into frames:

I have captured the livestream to my python script and broken it down into frames using OpenCV (cv2) library [6]. Then, I resized the frames so they could be fed into the model.


My recommended research paper for building a traffic accident detection model is Traffic Incident Detection Using Random Forest Classifier (2018). [7]. It reported precision and recall of 90.02% and 89.6%, respectively.

In this case, I used VGG-16 as my base model and achieved an accuracy of 88.5% and an improved recall of 99.4% to the paper I referred to

Traffic accident detection model that predicts frames in which collisions occur:

Figure-Road accident detected


Road Accident Detection Model

F1 SCORE0.933

Table-Tabulated performance metrics


We are all aware that crime rates continue to rise. Demand for security systems will be correspondingly higher in prime time. This consumes vast resources. Also, it requires endless hours of manual work to monitor CCTV footage. As technology advances through the years, it seems appropriate to automate this process.

The solution I propose is machine learning-based real-time CCTV surveillance patrol. I have narrowed down the use case to Automated Livestream-based patrolling for residential areas. The main focus of my project is to use live CCTV footage that feeds my model.

Traffic accidents can be detected with a negligible time buffer. If spotted, my patrol system will notify the relevant authorities immediately.

Additionally, my system eliminates the possibility of human error. These models are tested on standard performance metrics such as accuracy, precision, recall, and f1 score.

Future Scope

  1. A dedicated web app will be formed so that all the information regarding the mishap can be seen by all members of society to increase awareness.
  2. The model can be expanded to detect other activities like fighting, fire detection, burglary, and kidnapping.
  3. We can increase the accuracy of the model by training the model on a customized dataset of the related place.


[1]Spatio-Temporal Features with Deep Neural Networks,” IEEE Access, vol. 6, pp. 17913–17922,


[2] L. Wang, Y. Xu, J. Cheng, H. Xia, J. Yin, and J. Wu, “Human Action Recognition by Learning

Spatio-Temporal Features with Deep Neural Networks,” IEEE Access, vol. 6, pp. 17913–17922,


[3] Dogru, N., & Subasi, A. (2018). Traffic accident detection using random forest classifier. 2018

15th Learning and Technology Conference (L&T). doi:10.1109/lt.2018.8368509

[4] Keras Library https://keras.io/

[5] TensorFlow https://www.tensorflow.org/

[6] OpenCV https://opencv.org/


[8] https://towardsdatascience.com/understand-the-architecture-of-cnn-90a25e244c7



[11] UCF Crime Dataset:


[12] YouTube Videos