I have a log file which contains playerId values, some players have multiple entries in the file. I want to get an exact distinct count of to unique players, regardless of if they have 1 or multiple entries in the log file.
Using the query below it scans 497 records and finds 346 unique rows (346 is the number I want) Query:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
If I change my query to use count_distinct instead, I get exactly what I want. Example below:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count_distinct(playerId) as CT
The problem with count_distinct however is that as the query expands to a larger timeframe/more records the number of entries get into the thousands, and tens of thousands. This presents an issue as the numbers become approximations, due to the nature of Insights count_distinct behaviour...
"Returns the number of unique values for the field. If the field has very high cardinality (contains many unique values), the value returned by count_distinct is just an approximation.".
Docs: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
This is not acceptable, as I require exact numbers. Playing with the query a little, and sticking with count(), not count_distinct() I believe is the answer, however I've not been able to come to a single number... Examples which do not work... Any thoughts?
Ex 1:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "\"playerId\": \"*\"" as playerId
| stats count(playerId) as CT by playerId
| stats count(*)
We are having trouble understanding the query.
To be clear, I'm looking for an exact count to be returned in a single row showing the number.
What if we introduce a dummy field that's hardcoded to "1"? The idea is to retrieve its min value so that it stays as a "1" even if the same playerId
occurs more than once. And then we sum this field.
The log entry might look like this:
[1]"playerId": "1b45b168-00ed-42fe-a977-a8553440fe1a"
Query:
fields @timestamp, @message
| sort @timestamp desc
| filter @message like /(playerId)/
| parse @message "[*]\"playerId\": \"*\"" as dummyValue, playerId
| stats sum(min(dummyValue)) by playerId as CT
References used: