106
$\begingroup$

I recently asked a question regarding general principles around reviewing statistics in papers. What I would now like to ask, is what particularly irritates you when reviewing a paper, i.e. what's the best way to really annoy a statistical referee!

One example per answer, please.

$\endgroup$
2
  • $\begingroup$ Does it extend to justifications received in response to an initial review (where minor and/or major revisions were asked)? $\endgroup$
    – chl
    Commented Oct 20, 2010 at 19:14
  • $\begingroup$ @chl: Yes, why not. $\endgroup$ Commented Oct 20, 2010 at 19:19

19 Answers 19

73
$\begingroup$

What particularly irritates me personally is people who clearly used user-written packages for statistical software but don't cite them properly, or at all, thereby failing to give any credit to the authors. Doing so is particularly important when the authors are in academia and their jobs depend on publishing papers that get cited. (Perhaps I should add that, in my field, many of the culprits are not statisticians.)

$\endgroup$
4
  • 2
    $\begingroup$ +1 for me. This frustrates me, especially when they cite the wrong thing and I've provided the relevant details on how to cite the packages $\endgroup$ Commented Oct 20, 2010 at 19:36
  • 3
    $\begingroup$ Question: when citing a package, do you cite the vignette (if one exists) or the package itself? $\endgroup$ Commented Nov 24, 2010 at 6:55
  • 7
    $\begingroup$ @Brandon: if the package author cares enough to guide you, then they have given the answer in a form that will be picked up by citation("some_package") $\endgroup$
    – Ben Bolker
    Commented Dec 14, 2010 at 15:38
  • 2
    $\begingroup$ Aside from having a landmark paper, which is not so easy to do, the easiest way to get citations is to leave at least one error in your paper. Then you can publish a correction, which cites the original paper. Leave an error in the correction, and you can publish a correction which references the original correction and the original paper (I saw such a thing as a 1st year grad student). The number of citations grows as an O(N^2) process, where N is the number of corrections. $\endgroup$ Commented Oct 6, 2015 at 21:25
70
$\begingroup$

Goodness me, so many things come to mind...

  • Stepwise regression

  • Splitting continuous data into groups

  • Giving p-values but no measure of effect size

  • Describing data using the mean and the standard deviation without indicating whether the data were more-or-less symmetric and unimodal

  • Figures without clear captions (are those error bars standard errors of the mean, or standard deviations within groups, or what?)

$\endgroup$
8
  • 5
    $\begingroup$ I'm a little curious about the stepwise regression bullet. What makes stepwise regression so bad? Is it the data dredging and multiple comparisons issue? $\endgroup$ Commented Oct 20, 2010 at 19:56
  • 18
    $\begingroup$ The problem is that stepwise procedures completely invalidate all the assumptions and preconditions for "normal" inferential statistics based on p values, which are then badly biased (downwards towards being "more significant"). So basically, the answer is "yes", with the caveat that one could in principle correct for all these multiple comparisons (but which I have never seen done). I believe strongly that this is the single most important reason why I see so much research in psychology that cannot be replicated - which in turn leads to a huge waste of resources. $\endgroup$ Commented Oct 20, 2010 at 20:04
  • 10
    $\begingroup$ @Stephan: I agree, stepwise is a bad idea. Though, while they may have not made it to psych methods yet, but there are a variety of selection procedures that adjust for bias related to overfitting by adjusting estimates and standard errors. This is not typically thought of as an issue of multiple comparisons. They are known as shrinkage methods. See my response in this thread <stats.stackexchange.com/questions/499/…> and Harrell's "Regression Modeling Strategies" or Tibshirani on the lasso. $\endgroup$
    – Brett
    Commented Oct 20, 2010 at 20:34
  • 5
    $\begingroup$ @Brett Magill: +1 on that, and yes, I know about shrinkage and the lasso. Now all I need is some way to convince psychologists that these make sense... but people have been fighting with very limited success just to get psychologists to report confidence intervals, so I'm not too optimistic about psychologists' accepting shrinkage in the next twenty years. $\endgroup$ Commented Oct 20, 2010 at 20:42
  • 10
    $\begingroup$ I'd also argue that in psychology maximising prediction is not typically the theoretical aim, yet stepwise regression is all about maximising prediction, albeit in a quasi-parsimonious way. Thus, there is typically a disconnect between procedure and question. $\endgroup$ Commented Oct 21, 2010 at 6:28
41
$\begingroup$

Irene Stratton and colleague published a short paper about a closely related question:

Stratton IM, Neil A. How to ensure your paper is rejected by the statistical reviewer. Diabetic Medicine 2005; 22(4):371-373.

$\endgroup$
2
32
$\begingroup$

The code used to generate the simulated results is not provided. After asking for the code, it demands additional work to get it to run on a referee generated dataset.

$\endgroup$
1
  • 3
    $\begingroup$ And it's poorly formatted, uncommented, and uses indecipherable variable and function names. Ooooh yeah. $\endgroup$
    – naught101
    Commented Apr 18, 2012 at 10:26
31
$\begingroup$

Plagiarism (theoretical or methodological). My first review was indeed for a paper figuring many unreferenced copy/paste from a well-established methodological paper published 10 years ago.

Just found a couple of interesting papers on this topic: Authorship and plagiarism in science.

In the same vein, I find falsification (of data or results) the worst of all.

$\endgroup$
4
  • 20
    $\begingroup$ Reminds me that in my early days as a referee i spent far too long reviewing a statistical paper that was eventually rejected by that particular journal, but the other referees and I suggested a more useful application for the method, and I also sketched an algebraic proof to replace an unsatisfactory simulation study in the manuscript. The authors have since got two published papers out of it. I'm not annoyed by that, but an acknowledgement such as "we thank referees of an earlier version of the paper for helpful comments" would have been good manners. $\endgroup$
    – onestop
    Commented Oct 20, 2010 at 21:23
  • 1
    $\begingroup$ @onestop Yes, I can imagine how disappointing such a situation might be... $\endgroup$
    – chl
    Commented Oct 20, 2010 at 21:56
  • 24
    $\begingroup$ A few weeks ago I was given a paper to review and found that 85% of it had been published in another journal...by the same authors. That, too, is still considered plagiarism. For the last several years I have routinely submitted chunks of papers--especially abstracts, introductions, and conclusions--to Web search engines before doing any review. I want to be sure the work is original before I invest any time in reading it. $\endgroup$
    – whuber
    Commented Oct 21, 2010 at 4:40
  • 7
    $\begingroup$ +1, @whuber. As an editor of a methodological journal, I often have this tough job of figuring out whether the contribution (as a rule, from well established authors; the younger authors have not all gotten to that trajectory yet) warrants the publication given that all they've done is they reassembled in a different way the eight Lego blocks that comprised their previous five papers. This leads me to question the contribution in the preceding fifty papers these authors published, too :(. $\endgroup$
    – StasK
    Commented Apr 10, 2012 at 16:46
26
$\begingroup$

When we ask the authors for

  1. minor comment about an idea we have (in this sense, this not considered as a reason for rejecting the paper but just to be sure the authors are able to discuss another POV), or
  2. unclear or contradicting results,

and that authors don't really answer in case (1) or that the incriminated results in (2) disappear from the MS.

$\endgroup$
2
  • 8
    $\begingroup$ Mysteriously disappearing results should be automatic rejection, imo. I'm sure this happens a lot "behind the scenes" (i.e. before the paper is submitted), but this is clear evidence of "cherry picking" that normal readers of the paper would never know. $\endgroup$
    – Macro
    Commented Apr 13, 2012 at 13:10
  • 3
    $\begingroup$ Another reason for an open peer review system. $\endgroup$
    – fmark
    Commented Mar 3, 2014 at 9:12
24
$\begingroup$

Confusing p-values and effect size (i.e. stating my effect is large because I have a really tiny p-value).

Slightly different than Stephan's answer of excluding effect sizes but giving p-values. I agree you should give both (and hopefully understand the difference!)

$\endgroup$
23
$\begingroup$

Not including effect sizes.

P-ing all over the research (I have to credit my favorite grad school professor for that line).

Giving a preposterous number of digits (males gained 3.102019 pounds more than females)

Not including page numbers (that makes it harder to review)

Misnumbering figures and tables

(as already mentioned - stepwise and categorizing continuous variables)

$\endgroup$
1
  • 8
    $\begingroup$ (+1) laughed out loud at "Giving a preposterous number of digits (males gained 3.102019 pounds more than females)". $\endgroup$
    – Macro
    Commented Apr 13, 2012 at 13:04
19
$\begingroup$

When they don't sufficiently explain their analysis and/or include simple errors that make it difficult to work out what actually was done. This often includes throwing around a lot of jargon, by way of explanation, which is more ambiguous than the author seems to realize and also may be misused.

$\endgroup$
2
  • $\begingroup$ Agree -- struggling to understand what the author(s) meant before even evaluating the scientific content is really annoying. $\endgroup$
    – Laurent
    Commented Oct 22, 2010 at 17:04
  • 5
    $\begingroup$ I agree but I find it even more annoying when a reviewer tells you to omit (or move to suppl. materials) what are, realistically, very crucial details about the analysis. This problem makes it so that lots of science/social science papers that do even the most slightly complicated analysis are pretty cryptic in that regard. $\endgroup$
    – Macro
    Commented Apr 13, 2012 at 13:08
16
$\begingroup$

Using causal language to describe associations in observational data when omitted variables are almost certainly a serious concern.

$\endgroup$
2
  • 3
    $\begingroup$ I agree that researchers should understand the liabilities of observational research designs, especially those related to omitted variables, but I don't think avoiding causal language does this. See the work of Hubert Blalock, in particular his book Causal Inferences in Non-experimental Research for a more detailed argument in defense of using causal language. $\endgroup$
    – Andy W
    Commented May 22, 2011 at 21:07
  • 3
    $\begingroup$ (+1) This might be my single biggest problem with epidemiological research. $\endgroup$
    – Macro
    Commented Apr 13, 2012 at 13:06
15
$\begingroup$

Coming up with new words for the existing concepts, or, vice versa, using the existing terms to denote something different.

Some of the existing terminology differentials has long settled in the literature: longitudinal data in biostatistics vs. panel data in econometrics; cause and effect indicators in sociology vs. formative and reflective indicators in psychology; etc. I still hate them, but at least you can find a few thousand references to each of them in their respective literatures. The most recent one is this whole strand of work on directed acyclic graphs in causal literature: most, if not all, of the theory of identification and estimation in these has been developed by econometricians in the 1950s under the name of simultaneous equations.

The term that has double, if not triple, meaning, is "robust", and the different meanings are often contradictory. "Robust" standard errors are not robust for far outliers; moreover, they are not robust to against anything except the assumed deviation from the model, and often have dismal small sample performance. White's standard errors are not robust against serial or cluster correlations; "robust" standard errors in SEM are not robust against the misspecifications of the model structure (omitted paths or variables). Just like with the idea of the null hypothesis significance testing, it is impossible to point a finger at anybody and say: "You are responsible for confusing several generations of researchers for coining this concept that does not really stand for its name".

$\endgroup$
6
  • 2
    $\begingroup$ I have to admit committing both sins: I describe my data as "having a hierarchical structure: when I have levels with 1 : n relations (many measurements of each sample, multiple samples per patient). At some point I rather accidentally learned that this is called a "clustered" data structure - now I use both terms. But I still don't know how I could have found that term, I did look desparately for the word to describe my data structure... The other way round: I use techniques that are called soft classification in remote sensing. My field (chemometrics) uses it with quite different meaning. $\endgroup$
    – cbeleites
    Commented Apr 10, 2012 at 17:29
  • 2
    $\begingroup$ That's all fine -- you can add "multilevel" to your list of the ways to refer to this structure, too. "Clustered" usually means that the observations are known to be correlated, but nobody cares to model that correlation since it is not of primary interest, and gets away methods that are robust to such correlation, such as GEE. What you have is something like repeated measures MANOVA. There's a Stata package gllamm that thinks about your data as a multilevel/hierarchical data, but most other packages would think of multiple measurements as variables/columns, and samples as observations/rows. $\endgroup$
    – StasK
    Commented Apr 11, 2012 at 9:09
  • $\begingroup$ Thanks for the input. Well, nowadays I'd of course ask here how it is called... It's not exactly repeated measurements: usually I measure a number (order of magnitude: between 10^2 and 10^4) different spots on the sample in order to produce false-colour maps of different constituents, and each measurement already has 10^2 - 10^3 observations (wavelengths in the spectrum). Within each sample, many spectra are highly correlated, but not all: the samples are not homogeneous. ... $\endgroup$
    – cbeleites
    Commented Apr 11, 2012 at 15:47
  • 1
    $\begingroup$ ... Your description of "clustered" sounds very much like what we do. But I do take care to split the samples for validation, say I don't have any idea about the effective sample size (besides that it is at least the number of real samples involved), and sometimes show that having all those measurements of each sample actually helps for the model training. $\endgroup$
    – cbeleites
    Commented Apr 11, 2012 at 15:49
  • 1
    $\begingroup$ Interesting and challenging data, for sure. $\endgroup$
    – StasK
    Commented Apr 11, 2012 at 17:13
14
$\begingroup$

When authors use the one statistical test they know (in my field, usually a t-test or an ANOVA), ad infinitum, regardless of whether it's appropriate. I recently reviewed a paper where the authors wanted to compare a dozen different treatment groups, so they had done a two-sample t-test for every possible pair of treatments...

$\endgroup$
11
$\begingroup$

Zero consideration of missing data.

Many practical applications use data for which there are at least some missing values. This is certainly very true in epidemiology. Missing data presents problems for many statistical methods - including linear models. Missing data with linear models is often dealt with through deletion of cases with any missing data on any covariates. This is a problem, unless data are missing under an assumption that data are Missing Completely At Random (MCAR).

Perhaps 10 years ago, it was reasonable to publish results from linear models with no further consideration of missingness. I am certainly guilty of this. However, very good advice on how to deal with missing data with multiple imputation is now widely available, as are statistical packages/models/libraries/etc. to facilitate more appropriate analyses under more reasonable assumptions when missingness is present.

$\endgroup$
1
  • 1
    $\begingroup$ In the spirit of trying to educate, can you elaboorate more? What do you consider consideration - admitting it exists or adjusting the statistical analysis in the face of it (e.g. imputation). When applicable I try to include supp. tables of missing values by covariates of interest, but it isn't clear if this is sufficient for "consideration" by this remark. $\endgroup$
    – Andy W
    Commented Apr 3, 2013 at 16:04
9
$\begingroup$

Reporting effects that "approached significance ( p < .10 for example) and then writing about them as though they had attained significance at a more stringent and acceptable level. Running multiple Structural Equation Models that were not nested and then writing about them as though they were nested. Taking a well-established analytic strategy and presenting it as though no-one had ever thought of using it before. Perhaps this qualifies as plagiarism to the nth degree.

$\endgroup$
1
  • 1
    $\begingroup$ Maybe it's reinventing the wheel rather than plagiarism? $\endgroup$
    – gerrit
    Commented Jun 11, 2015 at 13:36
7
$\begingroup$

I recommend the following two articles:

Martin Bland:
How to Upset the Statistical Referee
This is based on a series of talks given by Martin Bland, along with data from other statistical referees (‘a convenience sample with a low response rate’). It ends with an 11-point list of ‘[h]ow to avoid upsetting the statistical referee’.

Stian Lydersen:
Statistical review: frequently given comments
This recent paper (published 2014/2015) lists the author’s 14 most common review comments, based on approx. 200 statistical reviews of scientific papers (in a particular journal). Each comment has a brief explanation of the problem and instructions on how to properly do the analysis/reporting. The list of cited references is a treasure trove of interesting papers.

$\endgroup$
1
  • $\begingroup$ The list by Lydersen is interesting. I think I disagree with a handful of them. . . $\endgroup$ Commented Oct 6, 2015 at 22:07
6
$\begingroup$

I'm most (and most frequently) annoyed by "validation" aiming at generalization error of predictive models where the test data is not independent (e.g. typically multiple measurements per patient in the data, out-of-bootstrap or cross validation splitting measurements not patients).

Even more annoying, papers that give such flawed cross validation results plus an independent test set that demonstrates the overoptimistic bias of the cross validation but not a single word that the design of the cross validation is wrong ...

(I'd be perfectly happy if the same data would be presented "we know the cross validation should split patients, but we're stuck with software that doesn't allow this. Therefore we tested a truly independent set of test patients in addition")

(I'm also aware that bootstrapping = resampling with replacement usually performs better than cross validation = resampling without replacement. However, we found for spectroscopic data (simulated spectra and slightly artificial model setup but real spectra) that repeated/iterated cross validation and out-of-bootstrap had similar overall uncertainty; oob had more bias but less variance - for rewieving, I'm looking at this from a very pragmatic perspective: repeated cross validation vs. out-of-bootstrap does not matter as long as many papers neither split patient-wise nor report/discuss/mention random uncertainty due to limited test sample size.)

Besides being wrong this also has the side effect that people who do a proper validation often have to defend why their results are so much worse than all those other results in the literature.

$\endgroup$
7
  • 1
    $\begingroup$ Not sure if you meant to say this but the "optimism" bootstrap is one of the best ways to validate a model, and its training and test samples overlap. $\endgroup$ Commented Apr 12, 2012 at 3:00
  • 1
    $\begingroup$ @Frank Harrell - I'm not sure I got your point. Maybe the difficulty is that in chemometrics "validation of a predictive model" always is about performance for new, unknown, future cases (in the example: diagnosing new patients). I use out-of-bootstrap or repeated/iterated cross validation all the time. Can you explain what the advantage of having test & train sets overlapping is compared to splitting at the patient level (I assume "overlapping" means splitting measurements so test & training measurements can belong to the same patient, always talking about an inter-patient model)? $\endgroup$
    – cbeleites
    Commented Apr 12, 2012 at 15:40
  • $\begingroup$ ... And yes, some points of the model validation can be answerd without splitting the data in distinct test and training cases (e.g. model stability in terms of the coefficients). But already model stability wrt. to the predictions should be measured using unknown patients (unknown: never appeared in the process of building the model including any data-driven pre-processing that takes all cases into account). Actually, for a traditional quantitation in chemometrics, the validation has steps that need further independently measured test data: ... $\endgroup$
    – cbeleites
    Commented Apr 12, 2012 at 15:48
  • $\begingroup$ good practice calls for unknown operator of the instrument and one important characteristic of the analytical method to be determined during validation is how often the calibration needs to be re-done (or showing that instrumental drift is negligible over a certain amount of time) - some authors even talk about an "abuse of resampling" that leads to a neglect of such independent test sets. $\endgroup$
    – cbeleites
    Commented Apr 12, 2012 at 15:53
  • 1
    $\begingroup$ If the equipment or measurement techniques are in need of validation, then an independent sample is required. But a common mistake is to use data splitting to try to simulate an independent validation. This is still an internal validation. To answer @cbeleites question above, the overlapped samples involved with bootstrapping will result in more accurate estimates of future model performance than data splitting in the majority of datasets one is likely to see. I have had data splitting perform poorly with n=17,000 and 0.30 event rate. $\endgroup$ Commented Apr 12, 2012 at 21:44
5
$\begingroup$

Using Microsoft Word rather than LaTeX.

$\endgroup$
4
$\begingroup$

Using "data" in a singular sense. Data ARE, they never is.

$\endgroup$
5
  • 2
    $\begingroup$ Probably a french statistician ;) $\endgroup$ Commented Apr 14, 2012 at 16:34
  • 9
    $\begingroup$ I must admit, I have recently abandoned the plural use of data after clinging to it for 10 years or so. I generally write for non-technical audiences and I was worried I was coming over pompous. The APA seem to still have a strict reading on its being plural but interestingly the Royal Statistical Society don't seem to have a particular view. There's an interesting discussion here: guardian.co.uk/news/datablog/2010/jul/16/data-plural-singular $\endgroup$ Commented May 17, 2012 at 14:11
  • 1
    $\begingroup$ I'm not a English speaker, but the problem with works like "data" or "media" in singular is that English has borrowed many others Latin words and you need use all Latin words in a consistent way. What's next? "Curricula is" or "Curriculum are"? "Medium are"? If "data" is Latin, then it is plural. End of the discussion. No matter how many people want to ignore it now. $\endgroup$
    – Fran
    Commented Nov 23, 2013 at 11:02
  • $\begingroup$ Maybe I'm misusing it, but I switch between singular and plurar depending on the context. $\endgroup$ Commented Oct 6, 2015 at 21:56
  • $\begingroup$ Use of the word 'datum' being low and only in sort of specialised circumstances, I think of the word 'data' as being something equivalent to the word 'pack' in respect of 'wolves'. It is certainly acceptable to use the word 'pack' in the singular to describe multiple wolves. The word 'Data' is gradually turning into its own collective noun... $\endgroup$ Commented Aug 11, 2016 at 5:16
3
$\begingroup$

For me by far is , attributing cause without any proper causal analysis or when there is improper causal inference.

I also hate it when zero attention is given to how missing data was handled. I see so many papers too where the authors simply perform complete case analysis and make no mention of whether or not the results are generalizable to the population with missing values or how the population with missing values might be systematically different from the population with complete data.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.