top of page

Using Computer Vision to Monitor Loitering in Train Stations


Train stations are bustling hubs, and with the continuous flow of passengers and employees, maintaining security, safety, and operational efficiency is a complex task. Key issues in this context include:

  • Security: Preventing criminal activities, ensuring passenger safety, and safeguarding infrastructure from vandalism or terrorism.

  • Operational Efficiency: Optimizing resource allocation, reducing delays, and improving the overall efficiency of station operations.

  • Customer Experience: Enhancing the passenger experience by minimizing congestion and delays and providing a sense of security and well-being.

  • Safety: Minimizing the risk of accidents, such as platform falls and overcrowding-related incidents.

  • Loitering Detection: Identifying and addressing loitering incidents that can disrupt station operations and pose security threats.

In this case study, we delve into a real-world example of a train station that has embraced computer vision for loitering detection, exploring the benefits and outcomes of using EyesOnIt. Loitering individuals can disrupt the orderly flow of passengers, potentially engage in criminal activities, or even pose a threat to themselves and others. Traditional surveillance methods, such as security cameras monitored by human personnel, often prove to be insufficient due to limitations in real-time monitoring, alertness, and scalability. Computer vision technology, powered by machine learning algorithms and capable of analyzing vast amounts of video data in real-time, offers a powerful solution to this challenge.

By implementing computer vision systems, train stations can enhance their ability to identify and respond to loitering incidents swiftly and accurately. These systems can continuously analyze video feeds from multiple cameras, detect suspicious behaviors, and trigger immediate alerts to security personnel, making it possible to address potential security threats proactively.

The Problem With Off-the-Shelf Computer Vision to Monitor Loitering

Using computer vision to monitor loitering is a well-known use case in the railroad industry. A large railroad company engaged EyesOnIt to develop a system that detects loitering in train stations during off hours. A wide variety of person detection neural networks are free and readily available online. EyesOnIt can use these neural networks to detect people, after which an alert can be sent to stakeholders. However, these existing neural networks carry several limitations.

Figure 1. Sample output from traditional person detection neural network.

Off-the-shelf computer vision models were insufficient for this particular use case for the following reasons:

  • Loitering typically involves people in non-standard positions (laying down, obscured by objects/baggage, poor lighting, etc.), or people actively trying to avoid detection.

  • Traditional neural networks are trained on datasets of people in everyday attire, with good lighting, and no obstructions.

  • Obtaining training data to enhance traditional neural networks is often difficult due to privacy concerns or the lack of historical data retention.

Figure 2. Sample difficult detection scenarios in test environment.


To address this problem, EyesOnIt made use of text-image models. These models are trained to match a text description of a scene to the scene itself and can be trained on billions of captioned images which are readily available online, alleviating the need to manually prepare each training sample. A large training corpus meant there were likely to be samples of “edge” or “difficult” cases in the training data when looking for a specific object. EyesOnIt’s skilled data scientists crafted text descriptions to correctly identify the scenarios the customer was interested in, a process known as prompt engineering.

Figure 3. An illustration of how text-image models are trained using images and their corresponding text captions.

Figure 4. An illustration of how an image-text model is used to select the correct text description of an input image. Note that this assumes the model has already been trained as shown in Figure 3.

One challenge of using a text-image model is that the text description will not match an image with high confidence if there are too many other items in the image that cause the text to only be a partial description of the image. This is especially a problem when one is using a camera with a wide field of view. To overcome this problem, EyesOnIt data scientists sliced video frames into multiple segments with some overlap. The overlap helps ensure that the object of interest will be completely contained in at least one slice, and will therefore be detected with high confidence. Figure 5 shows how slicing helps both localize and provide a high confidence detection when using a text-image model. When the bottom left segment is compared to text looking for a person, there will be a much higher degree of confidence that the text and segment match than when comparing the text to the whole image.

Figure 5. Slicing a video frame to increase confidence of detection.

There were many cases where a person in the train station after hours was not considered loitering; for example, a maintenance worker in the station after hours picking up an item or performing repairs. To mitigate false positives of this nature, our team of data scientists designed a mechanism which only triggers an alert after high confidence positive detections have been observed for a configurable amount of time. Once an alert has been sent, no new alerts will be triggered until after a configurable amount of “no loitering” detections have been made.

It is important to note that the EyesOnIt system has many “knobs” which can be tuned to satisfy the requirements of specific use cases. This includes detection triggering time, reset time, and confidence threshold at which the software counts an event as a positive detection. Figure 6 shows a graphic of this mechanism. The specific numbers are configurable, but we have chosen 10 positive detections and 60 negative detections for purposes of illustration.

Figure 6. False positive mitigation mechanism.

Our solution was tested thoroughly and performed successfully against both positive cases where a subject was present, and negative cases where a subject was not present. Successful detection of these conditions allowed our system to send accurate alerts without false positives.

The customer required our solution to integrate with their existing video management system (VMS). Upon determining that a notification should be made, EyesOnIt was configured to send an alert to the VMS, which specifies the camera and type of event that was detected. After the customer receives that alert from their VMS, they could take the appropriate action.


In the realm of train station monitoring, the implementation of computer vision to monitor loitering has emerged as a transformative solution. This case study highlights the distinct challenges faced by a large railroad company and the innovative approach undertaken by EyesOnIt. Traditional person detection neural networks, while readily available, proved insufficient in addressing the complexity of loitering scenarios often encountered in train stations during off hours.

EyesOnIt’s use of text-image models, trained on diverse captioned image datasets, provided a breakthrough solution by circumventing the need for manual sample preparation and exposing the models to real-world complexities. The system demonstrated its effectiveness through rigorous testing, enabling the customer to make informed decisions based on precise alerts.

The integration of EyesOnIt with the customer’s video management system further streamlined the monitoring process, offering actionable insights and setting a benchmark for the transportation industry. This case study underscores the potential of technology to enhance safety, security, and operational efficiency in train stations, highlighting the adaptability and precision of computer vision systems as a pivotal factor in shaping the future of this critical sector.



bottom of page