I have a data set of geo-located social media posts for which I am trying to filter by the frequency of user_id
greater than 1 (users who posted 2 or more times). I would like to filter this so I can further clean trajectory data I'm creating.
Sample code:
# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data
# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()
# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)
# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]
# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)
I tried using df[].value_counts()
to get a count of user_id, and then pass >1 in the line output = [[......data['user_id']==id>1].....
however, that did not work. Is it possible to add the frequency of user_id
as an additional argument to code and extract information for only those users?
Sample data:
user_id, timestamp, latitude, longitude
478134225, 3/12/2017 9:04, 38.8940974, -77.0276216
478103585, 3/12/2017 9:04, 38.882584, -77.1124701
478073193, 3/12/2017 9:07, 39.00027849, -77.09480086
476194185, 3/12/2017 9:14, 38.8048355, -77.0469214
476162349, 3/12/2017 9:16, 38.8940974, -77.0276216
478073193, 3/12/2017 9:05, 38.8549, -76.8752
477899275, 3/12/2017 9:08, 38.90181532, -77.03733586
477452890, 3/12/2017 9:08, 38.96117237, -76.95561893
478073193, 3/12/2017 9:05, 38.7188716, -77.1542684
Let's try to groupby
then the filter
method that returns only those records that evaluate as true in the filter function. In this case, return True for groups or user_id's that have more than one record.
df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)
A more efficient statement using transform and boolean indexing.
df[df.groupby('user_id')['user_id'].transform('count') > 1]