I am struggling to calculate the distance between multiple sets of latitude and longitude coordinates. In, short, I have found numerous tutorials that either use math or geopy. These tutorials work great when I just want to find the distance between ONE set of coordindates (or two unique locations). However, my objective is to scan a data set that has 400k combinations of origin and destination coordinates. One example of the code I have used is listed below, but it seems I am getting errors when my arrays are > 1 record. Any helpful tips would be much appreciated. Thank you.
# starting dataframe is df
lat1 = df.lat1.as_matrix()
long1 = df.long1.as_matrix()
lat2 = df.lat2.as_matrix()
long2 = df.df_long2.as_matrix()
from geopy.distance import vincenty
point1 = (lat1, long1)
point2 = (lat2, long2)
print(vincenty(point1, point2).miles)
Edit: here's a simple notebook example
A general approach, assuming that you have a DataFrame column containing points, and you want to calculate distances between all of them (If you have separate columns, first combine them into (lon, lat)
tuples, for instance). Name the new column coords
.
import pandas as pd
import numpy as np
from geopy.distance import vincenty
# assumes your DataFrame is named df, and its lon and lat columns are named lon and lat. Adjust as needed.
df['coords'] = zip(df.lat, df.lon)
# first, let's create a square DataFrame (think of it as a matrix if you like)
square = pd.DataFrame(
np.zeros(len(df) ** 2).reshape(len(df), len(df)),
index=df.index, columns=df.index)
This function looks up our 'end' coordinates from the df
DataFrame using the input column name, then applies the geopy vincenty()
function to each row in the input column, using the square.coords
column as the first argument. This works because the function is applied column-wise from right to left.
def get_distance(col):
end = df.ix[col.name]['coords']
return df['coords'].apply(vincenty, args=(end,), ellipsoid='WGS-84')
Now we're ready to calculate all the distances.
We're transposing the DataFrame (.T
), because the loc[]
method we'll be using to retrieve distances refers to index label, row label. However, our inner apply function (see above) populates a column with retrieved values
distances = square.apply(get_distance, axis=1).T
Your geopy
values are (IIRC) returned in kilometres, so you may need to convert these to whatever unit you want to use using .meters
, .miles
etc.
Something like the following should work:
def units(input_instance):
return input_instance.meters
distances_meters = distances.applymap(units)
You can now index into your distance matrix using e.g. loc[row_index, column_index]
.
You should be able to adapt the above fairly easily. You might have to adjust the apply
call in the get_distance
function to ensure you're passing the correct values to great_circle
. The pandas apply
docs might be useful, in particular with regard to passing positional arguments using args
(you'll need a recent pandas version for this to work).
This code hasn't been profiled, and there are probably much faster ways to do it, but it should be fairly quick for 400k distance calculations.
I can't remember whether geopy expects coordinates as (lon, lat)
or (lat, lon)
. I bet it's the latter (sigh).
Update Here's a working script as of May 2021.
import geopy.distance
# geopy DOES use latlon configuration
df['latlon'] = list(zip(df['lat'], df['lon']))
square = pd.DataFrame(
np.zeros((df.shape[0], df.shape[0])),
index=df.index, columns=df.index
)
# replacing distance.vicenty with distance.distance
def get_distance(col):
end = df.loc[col.name, 'latlon']
return df['latlon'].apply(geopy.distance.distance,
args=(end,),
ellipsoid='WGS-84'
)
distances = square.apply(get_distance, axis=1).T