Filter Pandas df by count of a column and write data

andrewr picture andrewr · Apr 8, 2017 · Viewed 7.3k times · Source

I have a data set of geo-located social media posts for which I am trying to filter by the frequency of user_id greater than 1 (users who posted 2 or more times). I would like to filter this so I can further clean trajectory data I'm creating.

Sample code:

# Import Data
data = pd.read_csv('path', delimiter=',', engine='python')
#print len(data),"rows"
#print data

# Create Data Fame
df = pd.DataFrame(data, columns=['user_id','timestamp','latitude','longitude'])
#print data.head()

# Get a list of unique user_id values
uniqueIds = np.unique(data['user_id'].values)

# Get the ordered (by timestamp) coordinates for each user_id
output = [[id,data.loc[data['user_id']==id].sort_values(by='timestamp')['latitude','longitude'].values.tolist()] for id in uniqueIds]

# Save outputs
outputs = pd.DataFrame(output)
#print outputs
outputs.to_csv('path', index=False, header=False)

I tried using df[].value_counts() to get a count of user_id, and then pass >1 in the line output = [[......data['user_id']==id>1]..... however, that did not work. Is it possible to add the frequency of user_id as an additional argument to code and extract information for only those users?

Sample data:

user_id, timestamp, latitude, longitude
478134225, 3/12/2017 9:04, 38.8940974, -77.0276216
478103585, 3/12/2017 9:04, 38.882584, -77.1124701
478073193, 3/12/2017 9:07, 39.00027849, -77.09480086
476194185, 3/12/2017 9:14, 38.8048355, -77.0469214
476162349, 3/12/2017 9:16, 38.8940974, -77.0276216
478073193, 3/12/2017 9:05, 38.8549, -76.8752
477899275, 3/12/2017 9:08, 38.90181532, -77.03733586
477452890, 3/12/2017 9:08, 38.96117237, -76.95561893
478073193, 3/12/2017 9:05, 38.7188716, -77.1542684

Answer

Scott Boston picture Scott Boston · Apr 8, 2017

Let's try to groupby then the filter method that returns only those records that evaluate as true in the filter function. In this case, return True for groups or user_id's that have more than one record.

df.groupby('user_id').filter(lambda x: x['user_id'].count()>1)

A more efficient statement using transform and boolean indexing.

df[df.groupby('user_id')['user_id'].transform('count') > 1]