Art of making Map App safe
Though this article is inspired by an Artist, however, we will discuss only the art of making map app safe. This is my third article and addressing one more serious issue in Google Maps and also for other maps application. In case you want to see previous articles please find below link:
In any software there as per me, there is 2 kind of bugs: Actual code bug and other is usage bug. We are well aware of code bugs and can be fixed easily, but preventing usage bugs is really difficult and sometimes results in disaster end. For example story of Sony Handicam camera, launched in 1998. It includes an infrared nightscope, intended for taking pictures of nocturnal animals or birds. However, one Japanese magazine reported that a filter costing less than seven U-S dollars enables Handycam users to look beneath certain kinds of clothing during the daytime when the camera’s \”night shot\” mode is activated. You can read the details below:
Ultimately Sony had to recall 700,000 video cameras. This comes under second category of Bugs.
Why I mentioned this story let’s read one more news about the exploitation of Google Maps application. An Artist who painted the route in Google Map red, not from his painting brush but with a simple trick or hack:
Man used 99 smartphones to create fake traffic jam and fool Google Maps: Here’s how
A man used only 99 smartphones and a small cart to fool Google Maps. Simon Weckert took to Youtube to share a video in which he can be seen pulling off the trick. He took 99 second-hand phones in a small cart with Google Maps running on all of them and simply walked the streets. As he walked, Google noticed too many users and sensed the slow-moving traffic and showed that the street has too many cars and hence congested. That was it. No technical skills were involved in this whole process but just a bit of common sense
Just in case if you want to see the video and increase views of Simon Weckert (an Artist, turned to a hacker ;)) please click here: Video
At first glance, this looks hilarious, however, I think that this can be a potential threat and we need to take steps seriously and sooner. For instance, consider this example: After a heavy but successful production deployment, we decided to do a weekend getaway. We inspected our car, put necessary stuff, and ready to explore. As soon as we started our car we lined up our favorite road trip playlist and turned on the navigation to the destination too. Now just before reaching our destination, we are seeing traffic on our map. There is an alternate route available and as we already wasted some time eating midway hotels we decided to take an alternate route to avoid traffic totally unaware of the fact that this traffic on the map is fake traffic created by some evil minds. On the alternate route, some looters were waiting for us, however, being a pro driver( ;)) and also an experienced one I decided to move back to the original route as I sensed some tenseness on this route. However to my surprise when I came to the original route, I found that there was no traffic, just a man with a bicycle-riding at a very slow speed.
Though this is just an example but enough to explain the threat.
We use Map application when we are not aware of our route fully or we want to estimate time of arrival, traffic prediction or explore nearby. Maps apps are really useful and true journey partner!!
So now we get the issue, it’s time to understand its characteristics, and then we will proceed ahead with its solution. During these covid times, 2 things which I am really missing from my office are 1. Coffee machine and 2. Giant whiteboards. Both help me in thinking through ideas thoroughly. But for time being I used my home’s whiteboard to understand the issue and designing solution:
I know its really difficult to understand my writing :) hence let me write and explain things here. Before moving to a solution let’s understand the characteristics of Genuine traffic and Fake traffic.
1. Characteristics of Genuine Traffic:
- There is a significant number of vehicles moving at a slow speed such as at 5–15 km/hr. Traffic quotient depends upon road to road. For example for a highway where minimum speed is defined at 50 km/hr, vehicle movement lower than 30 km/hr will be considered as slow traffic whereas on small roads with lots of curves speed of 30km/hr will be considered a traffic-free situation.
- The area in traffic is densely occupied by vehicles.
- Speed can be varying for different vehicles: Vehicles like Bus and cabs may move at a slower speed in comparison to Bikes and autos in traffic area.
There are some other characteristics that are beyond scope of this article such as:
- Temperature level, Pollution level and also Audio level will be more in comparison to free-flow traffic.
2. Characteristics of Fake Traffic:
Considering video posted by Simon Weckert we have the following observations about Fake traffic:
- Traffic is in a much smaller area in comparison to the area involved in Genuine traffic.
- Often this area can reside in a specific portion of the road, not the entire road.
- Of course, instead of lots of vehicles, there are lots of phones with Map app on are responsible for the traffic.
- The speed of all sources (Mobile devices or other GPS data providers) is the same. This is because all sources are kept in same trolley or box.
3. Solution
So basically in order to identify traffic is fake traffic or genuine depends upon its characteristics and hence by identifying these characteristic we can determine traffic type. For us there are 3 important inputs, a) area, b) speed and c) the magnitude of sources ( Sources referred here are mobile phones or gps data providers to Maps App).
So if area of the region comprised of phones is small and their acceleration is exactly same then it is more likely that traffic getting produce from this is fake traffic. But we also need to consider some more points. For example a bus can consist of more than 100 passengers having phone with Map application “ON” and this is totally a genuine case. And this example opens the door for our solution i.e to lower the magnitude of sources that are currently in the same vehicle or object.
GPS data from one mobile phone or 100 mobile phones gives us same details (data for traffic prediction) if they are in same object such as Bus or Box. Because their speed and their approximate cluster boundaries (Latitude and Longitude) (eg dimensions of bus or box ) will be same.
So we can do the following steps to determine fake traffic:
- If traffic predicted by Maps App algorithm go for determining Traffic-Type or Truth of Traffic.
- Create clusters using KNN or other relevant and effective algorithms depending upon speed, time, GPS coordinates.
- Calculate area of clusters and if this area is smaller than the typical area of bus or any other ideal area then its a fake traffic. We can also say that this traffic will not last long.
- For double check we can combine this with past history of this region and determine whether traffic at this time and in this area is quite possible.
- Also we can calculate real traffic status by minimizing the magnitude of sources belong to same cluster and thus consider only one source per cluster for traffic determination.
Let’s do coding:
#Import Prerequisites.import pandas as pdimport matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()
Let’s plot a graph representing the signature for our initiative.
fakeTrafficDF=df=pd.read_excel("C:/Users/Documents/{{MyDocuments}}/S
/Documents/{{My Documents}}/SachinUV/Inspiring/UltraMap/gpsDataFakeTraffic.xlsx")
fakeTrafficDF.head()
fakeTrafficDF.head()
Let’s understand dataset. This is imaginary data with data points populated considering mimicking an attempt to make fake traffic. GPS coordinates are obtained from below:
All co-ordinates are obtained from an area inside blue box.
Columns of the dataset:
- AnonymousSourceID: ID of source (Mobile phones)
- SourceLatitude&Longitude: GPS coordinates of source
- Acceleration_Trend(Km/Hour): Speed at which device is traveling(approximate)
- Date and Time: Date and time at the recorded instance
- Region: Location
Now as we can see in the above output snippet speed of all sources is same and also time. GPS-coordinates are near to each other. Some sources can have same GPS coordinates if they are placed top of other, this can also possible in double-decker bus running in Mumbai city.
Now let’s try to create clusters:
Now let’s try to find out clusters using k-means:
By the way, k-means is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. k-means is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.
A good idea about k-means can be obtained from this article: k-means
I am using k-means for sake of simplicity because from my experience I learned that the simpler you explain, the more the learner will gain.
Let’s see code:
kmeans = KMeans(n_clusters = 2, init ='k-means++')
kmeans.fit(X[X.columns[1:4]]) # Compute k-means clustering.
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:4]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[1:4]]) # Labels of each point
X.head(100)
Here K-means params are:
- n_clusters int, optional, default 8. The number of clusters to form as well as the number of centroids to generate.
- init {‘k-means++’, ‘random’ or an ndarray}. k-means++’: selects initial cluster centers for k-means clustering in a smart way to speed up convergence. random: choose k observations (rows) at random from data for the initial centroids. If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
This will give us 2 clusters :
K-means really worked well and given us 2 clusters from data points. Both clusters consist of the same source of GPS signals however at different geo points.
Now let’s calculate the length and breadth of these clusters and then the area. Let’s consider only cluster 0. These steps of calculating area from a series of GPS coordinates will really take you to your school time where we solve sums step by step.
Now lets try to calculate length of cluster 0. For this we will sort SourceLongitude column and find out minimum and maximum Longitude co-ordinates. This will provide us co-ordinates for length. Same step we will do for SourceLatitude column. This will provide us co-ordinates for breadth. Then we will put these co-ordinates in our distance function to calculate length and breadth, from this we can calculate area.
Now here comes the unique part.
We can use 2 methods for calculating the distance between 2 GPS co-ordinates. One is using Haversine formula (Haversine_formula) and another one is Vincenty distance(Vincenty_formulae). But after reading certain blogs I came to know that Vincenty distance is more accurate in comparison to Haversine formula. So lets use it.
#Calculation on basis of Vincenty distance
import geopy.distance
#Calculating length:
coords_1 = (0,72.9134789977002)
coords_2 = (0,72.91348771488344)
length=geopy.distance.distance(coords_1, coords_2).km
length_feet=length*3280.8
print ("Length in km: ", length)
print("Length in feet: ",length_feet)
#Calculating breadth
coords_1 = (19.11459028345533,0)
coords_2 = (19.114610561080944,0)
breadth=geopy.distance.distance(coords_1, coords_2).km
breadth_feet=breadth*3280.8
print ("breadth in km: ", breadth)
print("breadth in feet: ",breadth_feet)
print("Area of cluster: ", length_feet*breadth_feet)
Output:
Length in km: 0.0009703924004323924
Length in feet: 3.183663387338593
breadth in km: 0.002244600205765509
breadth in feet: 7.364084355075483
Area of cluster: 23.444765742526748
Now average City Transit Buses have average lengths of 39'2” (11.95 m), widths of 8'4” (2.55 m), heights of 9'10” (2.99 m), and have a capacity of 29 (+1) seats with standing room for 76. Details Source: Source
GPS data sources can be in bulk in buses and that’s a genuine case, however, if we calculate Bus area then area of average Bus (considering one dimension, just length, and breadth) is 11.95 m * 2.55 m = 30.47 square meters or 99.96 square feet. Our area of cluster (23.444765742526748) is quite less than this area of average Bus (99.96). This shows that more GPS sources are available in a small area. Hence this is our first proof of fake traffic.
Another major challenge is to nullify or lower the impact of bulk GPS data sources contributing to traffic. For example a Bus having 100 passengers out of which 25–40 passengers are using Map services, can impact traffic algorithm a lot. And if for some unforeseen reason Bus is running slow then traffic algo may think this as a reason for traffic though other vehicles are running at good speed. So let us see how we can minimize impact of bulk GPS data providers available in a single mode of travel and segregating signals in one signal per cluster. By the way, idea here is that whether there is one GPS data provider or multiple, in a bus, the quantum of all data is the same, because all of them will be traveling at same speed so there is no sense of considering data from all passengers and also one Bus may not contribute to traffic and if its actually creating traffic then data from other sources should also running slow.
For this let’s load another dataset that consists of inputs from more sources as well. This data is imaginary data considering the attempt of fake traffic along with inputs from normal traffic.
Now let's create clusters using k-means:
X_gcf=genuineCumFakeTrafficDF.loc[:,['AnonymousSourceID','SourceLatitude&Longitude','Acceleration_Trend(Km/Hour)']]X_gcf[['SourceLatitude','SourceLongitude']] = X_gcf['SourceLatitude&Longitude'].str.split(',',expand=True)del X_gcf['SourceLatitude&Longitude']kmeans = KMeans(n_clusters = 5, init ='k-means++')kmeans.fit(X_gcf[X_gcf.columns[1:4]]) # Compute k-means clustering.
X_gcf['cluster_label'] = kmeans.fit_predict(X_gcf[X_gcf.columns[1:4]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X_gcf[X_gcf.columns[1:4]]) # Labels of each point
X_gcf.head(100)
Output:
#Let's get the unique clusters:clusterList=X_gcf.cluster_label.unique()print(clusterList)[0 2 4 1 3]
So we have total of 5 clusters.
Now let’s take only one GPS source from each cluster in order to decrease the magnitude of sources.
df_random_per_cluster=pd.DataFrame()
for cluster_label_id in X_gcf.cluster_label.unique():
df_random_per_cluster=df_random_per_cluster.append(X_gcf.loc[X_gcf['cluster_label'] == cluster_label_id].sample(n = 1) )
df_random_per_cluster.head(10)
Output:
Now we will send this data back to the Map App traffic predictor to predict traffic. This will certainly prevent the algorithm to give Fake Traffic.
In the end, I wanted to say Art and Artist both have the capability to amaze you. Till then safe driving and keep exploring.
Thank you for reading. I hope it was worth your time. Please let me know in the comments for your feedback or insights.
Connect to me at LinkedIn !!