Inferring customer occupancy status in for-hire vehicles using PU Learning

Published in "In 8th ACM IKDD CODS and 26th COMAD."
Vaishnavi Muralidharan , Nandan Sudarsanam , Balaraman Ravindran

Data from Global Positioning Systems (GPS) and fare-meters in For-Hire vehicles (FHVs) have been used for various applications – both in research as well as organizational decision-making. The utility of such exercises largely depend on the accuracy of the data. This study looks at an environment where the data is partially mislabeled. Specifically, we take a common real-world setting where vehicle operators choose to render transportation services to customers without the use of a fare-meter, often by negotiating a fixed rate with the customer. This practice, which to different degrees, has been observed and documented across urban areas in the world, leads to various undesirable effects. In this study, we seek to identify cases of such behavior in the dataset. Typically, a supervised learning classifier could be built to predict the occupancy status from GPS traces, which can then be used, to look for anomalies between the predicted and stated behaviors. However, in our case the training dataset also contains instances of incorrect tagging. We address this problem by casting it as one of learning from Positive and Unlabeled instances (PU Learning) . This is owing to the fact that we observe the phenomenon of one-sided label noise, where trips tagged ‘vacant’ by the taximeter could be truly vacant or occupied, whereas trips tagged ‘occupied’ are expected to be occupied in reality as well. To support this novel formulation, we apply three state-of-the-art PU Learning algorithms on a real-world trajectory data set from an organization plying 170 active vehicles over a period of two months. We compare these to the baselines of standard supervised learning. Validation is carried out by the organization through alternate channels of investigation which is not indicated in the data set. The results show that the PU Learners provide a significant improvement in classification across a range of metrics when compared to the baseline approaches. This translates to a significant increase in identifying or reclassifying the mislabeled rides.