Greenland et al. (2016, doi.org/10.1007/s10654-016-014 ) suggested a more refined goal of statistics than testing study hypotheses is the evaluation of the (un)certainty of effect sizes.

I agree. The majority of studies are exploratory, hence testing hypotheses does not make sense.

#science #statistics #papers

@raoulvanoosten Don't you think this conflates testing (distinguishing signal from noise) and testing hypotheses (confirming or falsifying theoretical predictions)? Even if you do not want to do the second, you often want to do the first, no?

@lakens do you mean estimation of ES (un)certainty is enough both to distinguish signal from noise and to test hypotheses?

@raoulvanoosten That ius exactly the discussion - are you happy just estimating an effect size, and discussing it, even if that effect size can be noise? If so, then you do not need to test. But in practice, this is not what I see in papers - even in estimation papers. People often declare something 'an effect' -not just 'an estimate'

@lakens I'm not sure yet. I think significance testing (with p < .05) is a bad idea, while a priori determination of when claims will be accepted or refuted is essential. I am undecided whether ES and CIs are enough (I have not come across other frequentist methods yet).

@raoulvanoosten I think it is currently fashionable to think p<.05 is bad, while almost everyone thinks it is a useful tool that, as all statistics, is often used without sufficient training. The difference is important. I have never seen anyone provide a coherent view on making claims without error control

@lakens it is definitely fashionable to criticize NHST and the default p < .05. I think the good thing about that is that researchers shouldn't use these defaults without thinking about them. Which is mostly about education, like you say.

@raoulvanoosten I agree. And I think it is good to show how to do it better in practice. This step does not happen a lot - that was the whole point of the 'moving beyond p < 0.05' special issue - and even in that special issue, most solutions STILL recommended p-values, but proposed to add some additional statistic.

@lakens but the editors of that special issue suggest "don't use statistical significance". That's also my current standpoint: p-values are fine as a continuous metric, and more education is needed so people use them correctly.

Hypothesis testing needs more, and I'm unsure yet what.

@raoulvanoosten The editors of that special issue are, regrettably, incompetent biased unscientific individuals who, with their very bad editorial, led to a taskforce that had to correct their mistakes doi.org/10.1080/09332480.2021. so that people like you would not be misguided by them. So, I am very sorry if I just ignore that editorial altogether, and listen to more competent people.

@lakens subtle :P. I did not know this. I'll check out that paper. Thanks.

@raoulvanoosten The emotion is strong. I am writing a blog about this (which I rarely do anymore). They also lie that most of the articles in the special issue agree with their view. Not at all true. So unscientific. So misleading.

@lakens aw man, that sucks. Thanks for the Task Force paper. It seems I should check out all papers in the special issue myself, too.

@raoulvanoosten I added a short review in our revised version of psyarxiv.com/af9by/. The screenshots are the relevant paragraphs. I wanted to write a blog about it to discuss this in more detail (we are very brief in the paper).

@lakens so using interval hypothesis tests (and meaningful effect sizes)? When I started reading about the NHST-issue about a year ago (with papers like Meehl, 1967), that's what I thought but I couldn't find concrete examples and ended up with the "abandon significance" idea. Good to hear there is a body of work that supports the use of interval hypotheses.

@raoulvanoosten @lakens
The problem with confidence intervals is that they are random variables, as are p values. That means that when you calculate CI for your experiment, it is not right to say there's a 95% chance that the true value is included in the CI. A repeat of the experiment will give different CI.

@david_colquhoun so that's what Greenland et al (2016, # 202301051039 greenland2016) meant. So are they still useful?

@raoulvanoosten
Are CI still useful? The problem is that people tie themselves in knots when trying to define how they should be interpreted, just as they do with p values. That's why I love the Wagenmakers quotation. I've advocated supplementing the p value and CI (rather than abandoning them) with a likelihood ratio or some measure of false positive risk

@david_colquhoun @lakens I think indeed frequentist interference tries to answer the wrong question ("what is the probability of finding these data under the tested hypothesis?" rather than "what is the probability my tested hypothesis is true?"). Your false positive risk ( doi.org/10.1098/rsos.171085 ) is much closer to that question.

@raoulvanoosten @david_colquhoun just a reminder that we can never answer the question 'what is the probability my hypothesis is true' in science. Frequenties is not answering the wrong question. It is answering the right, and only, question.

@lakens @raoulvanoosten
The problem with p values is that they confuse two quite different quantities.
The probability that you have 4 legs given that you are a cow is high.
The probability that you are a cow, given that you have 4 legs is low.

@david_colquhoun @raoulvanoosten Except, this is not how p-values are used in practice. Their use is:

If this is a cow, it should have 4 legs. I perform a study. The data I have allow me to reject (with a small error rate) that the animal has 3 or less, or 5 or more, legs. Hence, I will act as if the animal has 4 legs.

The probability that it is a cow is not something any single study can quantify. Also also find attempts to quantify that probability rather uninteresting.

@lakens @raoulvanoosten
Even without Bayes, surely it would be better to use a likelihood ratio be cause that's the way to quantitate the evidence from your experiment. eg
P( obs ! cow)/P(obs | not cow)

@david_colquhoun @raoulvanoosten I don't think it is better - it gives additional information. Just like also computing the effect size gives additional information. I am not against it - but it is like asking me if I should by bread, or cheese, for lunch. Probably prefer both :)

@lakens @raoulvanoosten
Yes, but the information given by the LR contradicts that from the p value (at least in cases where it's sensible to test a point null). That surely means that you have to decide which (LR or p) is more sensible?

@david_colquhoun @lakens

Fisherian aproach was intoduced in 1925, while Jerzy Neyman proposed his solution in 1926.
It's amazing that after almost 100 years, there are lively discussions about them!

In my opinion, these are two different, not always competitive tools, like screws and nails.
If we have specified hypothesis to test, it'll be better to use Neyman approach and calculate LR or other suitable test, with satisfying α and β.

But, when we want to just test null hypothesis for error detection, or basic exploratory analysis, p-value seems to be good approach.

@plenartowicz @lakens
I can't agree with your last sentence. p-values are over-optimistic (at least in cases where it's sensible to test a point null) eg -that's been well understood since Sellke & Berger 1987 -see eg royalsocietypublishing.org/doi
and
tandfonline.com/doi/full/10.10
and Benjamin & Berger 2019
tandfonline.com/doi/full/10.10

That's why Robert Matthews said:

Follow

@david_colquhoun @lakens

I'm quite sure that I agree with you. But, in my opinion, it's problem with lack of rigour and putting too much confidence in "statistical ritual" than p-value itself. Let me provide some example:

1) We are using p-value to detect candidate genes in some traits, like depression, anxiety, intelligence...

2) We can conduct some meta-analysis of such analysis, to make sure that candidate gene has effect, and obtain some "satisfying low p-value", like p = .00002 (σ > 4) and based on these results to conclude that we have strong evidence for effect.

Point "2)" is wrong, and there is a little evidence in it, but I think, that using p-value to detect "candidates" is defendable. Or, at least, I'm not aware of any better method.

@plenartowicz @lakens
A lot of work has gone into genome analyses -they are a good example of the failure of p values. But I'm not at all expert in that area so I'm sorry, but I can't point you to better methods

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.