I am not a #DataScience person, so I need the wisdom of the #LazyWeb to help me out, please.

(I’m running queries on #Splunk, but I don’t think this question applies to Splunk only.)

I have a report running hourly to calculate metrics and store these to a separate index (in Splunk terms, a “summary metrics index”), for faster querying later. It's a data roll-up. (1/4)

The metrics I’m calculating include response times in quantiles (i.e., P50, P90, P99) and total requests. I also have a variety of dimensions I’m storing along with the metrics, so that I can filter on them in queries against the index.

When I query the index and do a `sum(total_requests)` with a filter, I get back the correct results. Because it's just a plain number. (2/4)

But… maybe you can see where this is going… when I try to `avg(p50)` (for example), the number is WILDLY off, for reasons that are probably obvious to you but weren’t immediately obvious to me. That’s because all those dimensions make each group of data so much smaller, so the quantiles are influenced significantly by the outliers, and you can't average them together at the end and get an accurate result. (3/4)

Follow

@ramsey If you actually need to do it this way (i.e. calculate from imperfect data) and along the stats you've stored the number of items in each bucket (parameter set), you can do a weighted average to emphasize the larger buckets, that should be closer to the actual result.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.