Tuning memory-bound code is complete witchcraft. If the selected loop blocking size is small, it reduces DRAM traffic due to cache reuse, but now control overhead may stall the pipeline. If the block size is medium, it increases DRAM traffic instead due since it just trashes the cache. If the block size is large, DRAM traffic returns to the baseline level and you may as well do nothing. There's simply no way to win. ​:woozy_baa:​

Follow

@niconiconi what exactly do you mean by control overhead? The literal flops/iops/memory bw spent to decide whether to terminate the loop, or something like branch mispredictions in the same, or something totally different?

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.