Logo Studenta

Carlisle J - 2017 - Data fabrication and other reasons for nonrandom sampling in 5087 randomised controlled trials in anaesthetic and general medical journals

¡Estudia con miles de materiales!

Vista previa del material en texto

Original Article
Data fabrication and other reasons for non-random sampling in
5087 randomised, controlled trials in anaesthetic and general
medical journals
J. B. Carlisle
Consultant, Department of Anaesthesia, Peri-operative Medicine and Intensive Care, Torbay Hospital, UK
Summary
Randomised, controlled trials have been retracted after publication because of data fabrication and inadequate ethical
approval. Fabricated data have included baseline variables, for instance, age, height or weight. Statistical tests can
determine the probability of the distribution of means, given their standard deviation and the number of participants
in each group. Randomised, controlled trials have been retracted after the data distributions have been calculated as
improbable. Most retracted trials have been written by anaesthetists and published by specialist anaesthetic journals. I
wanted to explore whether the distribution of baseline data in trials was consistent with the expected distribution. I
wanted to determine whether trials retracted after publication had distributions different to trials that have not been
retracted. I wanted to determine whether data distributions in trials published in specialist anaesthetic journals have
been different to distributions in non-specialist medical journals. I analysed the distribution of 72,261 means of 29,789
variables in 5087 randomised, controlled trials published in eight journals between January 2000 and December 2015:
Anaesthesia (399); Anesthesia and Analgesia (1288); Anesthesiology (541); British Journal of Anaesthesia (618); Cana-
dian Journal of Anesthesia (384); European Journal of Anaesthesiology (404); Journal of the American Medical Associa-
tion (518) and New England Journal of Medicine (935). I chose these journals as I had electronic access to the full text.
Trial p values were distorted by an excess of baseline means that were similar and an excess that were dissimilar: 763/
5015 (15.2%) trials that had not been retracted from publication had p values that were within 0.05 of 0 or 1 (expected
10%), that is, a 5.2% excess, p = 1.2 9 10�7. The p values of 31/72 (43%) trials that had been retracted after publica-
tion were within 0.05 of 0 or 1, a rate different to that for unretracted trials, p = 1.03 9 10�10. The difference between
the distributions of these two subgroups was confirmed by comparison of their overall distributions, p = 5.3 9 10�15.
Each journal exhibited the same abnormal distribution of baseline means. There was no difference in distributions of
baseline means for 1453 trials in non-anaesthetic journals and 3634 trials in anaesthetic journals, p = 0.30. The rate of
retractions from JAMA and NEJM, 6/1453 or 1 in 242, was one-quarter the rate from the six anaesthetic journals, 66/
3634 or 1 in 55, relative risk (99%CI) 0.23 (0.08–0.68), p = 0.00022. A probability threshold of 1 in 10,000 identified
8/72 (11%) retracted trials (7 by Fujii et al.) and 82/5015 (1.6%) unretracted trials. Some p values were so extreme that
the baseline data could not be correct: for instance, for 43/5015 unretracted trials the probability was less than 1 in
1015 (equivalent to one drop of water in 20,000 Olympic-sized swimming pools). A probability threshold of 1 in 100
for two or more trials by the same author identified three authors of retracted trials (Boldt, Fujii and Reuben) and 21
first or corresponding authors of 65 unretracted trials. Fraud, unintentional error, correlation, stratified allocation and
poor methodology might have contributed to the excess of randomised, controlled trials with similar or dissimilar
means, a pattern that was common to all the surveyed journals. It is likely that this work will lead to the identification,
correction and retraction of hitherto unretracted randomised, controlled trials.
944 © 2017 The Association of Anaesthetists of Great Britain and Ireland
Anaesthesia 2017, 72, 944–952 doi:10.1111/anae.13938
.................................................................................................................................................................
Correspondence to: J. B. Carlisle
Email: john.carlisle@nhs.net
Accepted: 28 April 2017
Keywords: data error; fraud; randomised, controlled trials
This article is accompanied by an editorial by Loadsman and McCulloch, Anaesthesia 2017; 72: 931–5.
Introduction
Techniques have been developed to analyse baseline
variables, particularly the mean (SD) of continuous
variables, and these have helped to identify fabricated
data in randomised, controlled trials by Fujii et al. [1].
The general principles of these methods have been
explained elsewhere [2, 3]. The same approach has
been used to investigate trials published by Yuhji Sai-
toh, a co-author of Dr Fujii, and form a component of
the investigation of his work [4]. The technique has
recently identified systematic problems with data in 33
randomised trials by Yoshihiro Sato, who is not an
anaesthetist [5].
Fujii features top in a list of biomedical authors
with the most retractions, and, for quite separate rea-
sons from those based on statistical data analysis, two
other anaesthetists appear in this list: second (Boldt)
and fifteenth (Reuben) [6]. Several questions arise. Are
trials published by anaesthetists more likely to be
retracted than trials from other specialists? Are anaes-
thetists more likely to generate fabricated data in tri-
als? Would the statistical methods used to discover
issues with data published by Fujii and Saitoh [1, 3]
also retrospectively find aberrations in baseline data of
trials published by authors like Boldt and Reuben?
The purpose of this survey is to assess if: (1) the
distribution of baseline means corresponded to the
expected distribution and whether discrepancies were
shared by leading non-anaesthetic vs. anaesthetic jour-
nals; (2) there was a different rate of retraction in lead-
ing non-anaesthetic vs. anaesthetic journals; and (3)
data corruption was discoverable by the new statistical
techniques in those papers/authors that had been
retracted. I used the method to detect anomalies in the
distributions of baseline variable mean (SD) from ran-
domised, controlled trials published during 15 years in
six specialist anaesthetic journals (Anaesthesia,
Anesthesia and Analgesia, Anesthesiology, the British
Journal of Anaesthesia, the Canadian Journal of Anes-
thesia and the European Journal of Anaesthesiology)
and two general medical journals (Journal of the Amer-
ican Medical Association (JAMA) and New England
Journal of Medicine (NEJM)).
Methods
I searched eight journals (to which I had electronic
access) for randomised, controlled trials published
between January 2000 and December 2015: Anaesthe-
sia; Anesthesia and Analgesia; Anesthesiology; British
Journal of Anaesthesia; Canadian Journal of Anesthe-
sia; European Journal of Anaesthesiology (2002–2012);
JAMA; and NEJM. I extracted baseline summary data
for continuous variables, reported as mean (SD) or
mean (SEM). I did not study trials for which partici-
pant allocation was not described as random, or trials
that did not report baseline continuous variables, or
those that reported a different summary measure, such
as median (IQR or range). I defined ‘baseline’ as a
variable measured before groups were exposed to the
allocated intervention, variables such as age, height,
‘baseline’ blood pressure or serum sodium concentra-
tion. I excluded variables that had been stratified. I
recorded whether the allocation sequence had been
generated in blocks, permuted or otherwise, which
could reduce the distribution of means for time-vary-
ing measurements.
The primary outcome was the distribution of p
values, calculated for differences between means, for
individual variables and when combined within trials.
I used three methods to generate p values for individ-
ual variables: independent t-test; ANOVA; and Monte
Carlo simulations[5], adjusted for the precision to
which mean (SD) were reported. The p value gener-
ated by these three methods that was closest to 0.5
© 2017 The Association of Anaesthetists of Great Britain and Ireland 945
Carlisle | Sampling distributions in randomised, controlled trials Anaesthesia 2017, 72, 944–952
was combined with the p values for other variables
from a trial. I used the sum of the z values (Stouffer’s
method) as the primary method to combine p values
for different variables within a randomised, controlled
trial. I also calculated the results of five other methods
used to combine p values into a single probability for
each trial: logit; mean; Wilkinson’s method; sum of log
(Fisher’s method); and sum. I used the Anderson–Dar-
ling test to compare the distribution of p values with
the expected uniform distribution, which interrogates
the extremes of the distribution; the Kolmogorov–
Smirnov test assesses the central distribution. I
checked the mean (SD) of trials in which one or more
p < 0.01 or > 0.99, as these would indicate excessively
narrow or excessively wide distributions. I checked
whether substitution of SEM for SD, and vice versa,
resulted in a less extreme p value, should the authors
have incorrectly labelled one for the other.
A secondary analysis included comparison of p
values for randomised, controlled trials that had been
retracted vs. trials that had not been retracted. All
analyses were conducted in R (R Foundation for Statis-
tical Computing, Vienna, Austria), packages (function):
Anderson–Darling test, ‘goftest’ (ad.test) and ‘kSam-
ples’ (ad.test, Steel.test); ANOVA, ‘rpsychi’
(ind.oneway.second) and ‘CarletonStats’ (anovaSum-
marized); t-test, ‘BSDA’ (tsum.test); the quantile–quan-
tile plot, ‘qqtest’ (qqtest); the combination of p values,
‘metap’ (logitp, meanp, minimump, sumlog, sump,
sumz). All p values were one-sided and inverted, such
that dissimilar means generated p values near 1.
Results
I scanned 9673 clinical trials for the random allocation
of baseline variables reported as mean (SD or SEM):
4586 were not randomised, controlled trials or did not
present unstratified baseline mean (SD or SE) data. I
therefore analysed 5087 trials, which included 72,261
means of 29,789 variables. The supplementary appen-
dices list the trials and the results of analyses
(Appendix S1) and the data that I analysed
(Appendix S2).
The distribution of 72,261 baseline means was
largely consistent with random sampling, with 5087
trial p values being contained within the 99% confi-
dence interval of the cumulative uniform distribution,
between p values of 0.15 and 0.95 (Fig. 1). However,
there were more trials than expected with baseline
means that were similar (near a p value of 0) or dis-
similar (near a p value of 1): 794/5087 (15.6%) trial p
values were within 0.05 of 0 or 1, that is, 5.6% more
than expected or 1 in 18 trials (Fig. 1 and Tables 1
and 2). Consequently, the distribution of trial p val-
ues deviated from the expected distribution,
p = 1.2 9 10�7. Each journal had the same propor-
tion of trials with extreme p values (Fig. 2). Although
the distribution of p values was not the same in all
journals, p = 0.007, there were no significant differ-
ences when one journal was tested against any other.
There was no difference in distributions of baseline
variables of 1453 trials published in non-anaesthetic
journals and 3634 trials published in anaesthetic jour-
nals, p = 0.30 (Fig. 3).
More baseline variables from 72 retracted trials
had very similar means or very dissimilar means than
5015 trials that are not retracted, p = 5.3 9 10�15
(Fig. 4 and Table 1). The exclusion of retracted trials
(from 5087) did not resolve the discrepancies between
the observed and expected distribution of baseline
Uniform quantiles
S
am
pl
e 
qu
an
til
es
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Figure 1 A quantile–quantile plot of p values calcu-
lated for 5087 randomised, controlled trials ( ) com-
pared with the 99%CI for the expected uniform
distribution ( ). The point of interest is the deviation
of p values in the trials from what is expected, at val-
ues less than 0.15 and more than 0.95.
946 © 2017 The Association of Anaesthetists of Great Britain and Ireland
Anaesthesia 2017, 72, 944–952 Carlisle | Sampling distributions in randomised, controlled trials
T
ab
le
1
T
he
nu
m
be
rs
of
ra
nd
om
is
ed
,
co
nt
ro
lle
d
tr
ia
ls
an
al
ys
ed
fr
om
ei
gh
t
jo
ur
na
ls
an
d
th
e
nu
m
be
r
(p
ro
po
rt
io
n)
w
it
h
p
va
lu
es
fo
r
ba
se
lin
e
m
ea
ns
w
it
hi
n
0.
05
of
0
or
1.
Jo
u
rn
a
l
S
ta
tu
s
M
e
a
n
s
V
a
ri
a
b
le
s
T
ri
a
ls
P
ro
b
a
b
il
it
y
o
f
th
e
d
is
tr
ib
u
ti
o
n
o
f
b
a
se
li
n
e
v
a
ri
a
b
le
m
e
a
n
v
a
lu
e
s
p
<
0
.0
0
1
0
.0
1
>
p
>
0
.0
0
1
0
.0
5
>
p
>
0
.0
1
0
.9
5
<
p
<
0
.9
9
0
.9
9
<
p
<
0
.9
9
9
p
>
0
.9
9
9
0
.1
%
0
.9
%
4
%
4
%
0
.9
%
0
.1
%
A
n
a
e
st
h
e
si
a
N
o
t
re
tr
a
ct
e
d
4
1
6
9
1
7
9
9
3
9
3
0
5
(1
.3
%
)
2
4
(6
.0
%
)
2
6
(6
.5
%
)
8
(2
.0
%
)
8
(2
.0
%
)
R
e
tr
a
ct
e
d
4
9
3
3
6
0
0
0
0
0
0
T
o
ta
l
4
2
1
8
1
8
3
2
3
9
9
0
5
(1
.3
%
)
2
4
(6
.0
%
)
2
6
(6
.5
%
)
8
(2
.0
%
)
8
(2
.0
%
)
A
n
e
st
h
e
si
a
a
n
d
A
n
a
lg
e
si
a
N
o
t
re
tr
a
ct
e
d
1
4
,5
5
6
5
7
2
0
1
2
5
4
3
(0
.2
%
)
2
6
(2
.1
%
)
4
8
(3
.9
%
)
6
8
(5
.5
%
)
2
2
(1
.8
%
)
2
5
(2
.0
%
)
R
e
tr
a
ct
e
d
6
1
2
2
3
8
3
4
5
(1
6
%
)
3
(1
0
%
)
2
(6
%
)
1
(3
%
)
2
(6
%
)
1
(3
%
)
T
o
ta
l
1
5
,1
6
8
5
9
5
8
1
2
8
8
8
(0
.6
%
)
2
9
(2
.3
%
)
5
0
(3
.9
%
)
6
9
(5
.4
%
)
2
4
(1
.9
%
)
2
6
(2
.0
%
)
A
n
e
st
h
e
si
o
lo
g
y
N
o
t
re
tr
a
ct
e
d
9
3
9
9
3
6
1
7
5
3
7
6
(1
.1
%
)
8
(1
.5
%
)
2
3
(4
.3
%
)
2
3
(4
.3
%
)
9
(1
.7
%
)
1
4
(2
.6
%
)
R
e
tr
a
ct
e
d
8
2
2
8
4
1
(2
0
%
)
0
0
0
0
0
T
o
ta
l
9
4
8
1
3
6
4
5
5
4
1
7
(1
.3
%
)
8
(1
.5
%
)
2
3
(4
.3
%
)
2
3
(4
.3
%
)
9
(1
.7
%
)
1
4
(2
.6
%
)
B
ri
ti
sh
Jo
u
rn
a
l
o
f
A
n
a
e
st
h
e
si
a
N
o
t
re
tr
a
ct
e
d
6
4
9
2
2
7
5
9
6
1
4
1
(0
.2
%
)
6
(1
.0
%
)
3
9
(6
.3
%
)
2
8
(4
.5
%
)
3
(0
.5
%
)
1
1
(1
.8
%
)
R
e
tr
a
ct
e
d
7
7
3
0
4
1
(1
3
%
)
0
0
1
(1
3
%
)
0
0
T
o
ta
l
6
5
6
9
2
7
8
9
6
1
8
2
(0
.3
%
)
6
(1
.0
%
)
3
9
(6
.3
%
)
2
9
(4
.7
%
)
3
(0
.5
%
)
1
1
(1
.8
%
)
C
a
n
a
d
ia
n
Jo
u
rn
a
l
o
f
A
n
e
st
h
e
si
a
N
o
t
re
tr
a
ct
e
d
3
9
0
7
1
6
3
2
3
7
3
3
(0
.8
%
)
7
(1
.8
%
)
1
5
(3
.9
%
)
1
6
(4
.2
%
)
4
(1
.0
%
)
9
(2
.3
%
)
R
e
tr
a
ct
e
d
2
9
0
8
8
1
1
3
(2
7
%
)
1
(9
%
)
2
(1
8
%
)
0
0
0
T
o
ta
l
4
1
9
7
1
7
2
0
3
8
4
6
(1
.6
%
)
8
(2
.1
%
)
1
7
(4
.4
%
)
1
6
(4
.2
%
)
4
(1
.0
%
)
9
(2
.3
%
)
E
u
ro
p
e
a
n
Jo
u
rn
a
l
o
f
A
n
a
e
st
h
e
si
o
lo
g
y
N
o
t
re
tr
a
ct
e
d
4
2
2
6
1
8
1
1
3
9
7
1
(0
.2
%
)
2
(0
.5
%
)
1
4
(3
.5
%
)
3
0
(7
.6
%
)
9
(2
.2
%
)
1
2
(3
.0
%
)
R
e
tr
a
ct
e
d
9
4
3
6
7
0
0
2
(2
9
%
)
1
(1
4
%
)
0
0
T
o
ta
l
4
3
2
0
1
8
4
7
4
0
4
1
(0
.2
%
)
2
(0
.5
%
)
1
6
(4
.0
%
)
3
1
(7
.7
%
)
9
(2
.2
%
)
1
2
(3
.0
%
)
Jo
u
rn
a
l
o
f
th
e
A
m
e
ri
ca
n
M
e
d
ic
a
l
A
ss
o
ci
a
ti
o
n
N
o
t
re
tr
a
ct
e
d
1
0
,7
1
7
4
4
8
1
5
1
3
2
(0
.4
%
)
8
(1
.6
%
)
2
7
(5
.2
%
)
2
3
(4
.4
%
)
1
1
(2
.1
%
)
1
0
(1
.9
%
)
R
e
tr
a
ct
e
d
1
6
3
7
6
5
1
(2
0
%
)
1
(2
0
%
)
0
0
0
0
T
o
ta
l
1
0
,8
8
0
4
5
5
7
5
1
8
3
(0
.6
%
)
9
(1
.7
%
)
2
7
(5
.2
%
)
2
3
(4
.4
%
)
1
1
(2
.1
%
)
1
0
(1
.9
%
)
N
e
w
E
n
g
la
n
d
Jo
u
rn
a
l
o
f
M
e
d
ic
in
e
N
o
t
re
tr
a
ct
e
d
1
7
,4
0
4
7
4
2
9
9
3
4
5
(0
.5
%
)
1
2
(1
.3
%
)
5
0
(5
.3
%
)
3
4
(3
.6
%
)
1
3
(1
.4
%
)
1
0
(1
.1
%
)
R
e
tr
a
ct
e
d
2
4
1
2
1
0
0
0
0
1
(1
0
0
%
)
0
T
o
ta
l
1
7
,4
2
8
7
4
4
1
9
3
5
5
(0
.5
%
)
1
2
(1
.3
%
)
5
0
(5
.3
%
)
3
4
(3
.6
%
)
1
3
(1
.4
%
)
1
0
(1
.1
%
)
T
o
ta
l
N
o
t
re
tr
a
ct
e
d
7
0
,8
7
3
2
9
,2
5
0
5
0
1
5
2
2
(0
.4
%
)
7
2
(1
.5
%
)
2
4
4
(4
.8
%
)
2
4
7
(4
.9
%
)
7
7
(1
.6
%
)1
0
1
(2
.0
%
)
R
e
tr
a
ct
e
d
1
3
8
8
5
3
9
7
2
1
1
(1
5
%
)
6
(8
%
)
7
(1
0
%
)
3
(4
%
)
3
(4
%
)
1
(1
%
)
T
o
ta
l
7
2
,2
6
1
2
9
,7
8
9
5
0
8
7
3
3
(0
.6
%
)
7
8
(1
.5
%
)
2
5
1
(4
.9
%
)
2
5
0
(4
.9
%
)
8
0
(1
.6
%
)
1
0
2
(2
.0
%
)
© 2017 The Association of Anaesthetists of Great Britain and Ireland 947
Carlisle | Sampling distributions in randomised, controlled trials Anaesthesia 2017, 72, 944–952
means in the remaining 5015 trials (the p value
remained 1.2 9 10�7). The rate of retracted articles
from the two general medical journals (6/1453) was
one-quarter the rate from the six specialist anaesthetic
journals (66/3634), relative risk (99%CI) 0.23 (0.08–
0.68), p = 0.0002. The p values of the six trials
retracted from JAMA or NEJM were 6.3 9 10�8,
0.0097, 0.057, 0.21, 0.37 and 0.9988.
To assess if it might be possible to use probability
to determine which trials and authors to investigate, to
correct erroneous data or to retract fabricated data, I
applied different investigative probability thresholds. A
threshold of 1 in 10,000 (0.5 in 5000) would have cap-
tured 8/72 (11%) retracted trials (7 by Yoshitaka Fujii)
and 82/5015 (1.6%) trials that have not yet been cor-
rected or retracted. I supplemented this approach by
investigating authors of more than one trial for which
the probability was, arbitrarily, less than 1 in 100. This
identified trials (number) by Yoshitaka Fujii (13), Joa-
chim Boldt (3) and Scott Reuben (2). I searched
through all the trials by first author and corresponding
author and identified 21 other authors of more than
one trial (65 in total) with a probability less than 1 in
100. A more thorough but laborious method would be
to combine the probabilities calculated for all the trials
published by individuals. For instance, five trials pub-
lished in JAMA by an individual (four as correspond-
ing author) generated p values of 0.012, 0.030, 0.047,
0.20 and 0.48, that is, they would not have been identi-
fied with the first two methods described in this
paragraph. The composite distribution of all p values
from these five trials generate p = 0.0011 with the
Anderson–Darling test and p = 0.00045 with the sum
of z score statistic. Explanations other than chance
include corrupted data that are only revealed on pool-
ing data from multiple trials by the same author.
The examination of some trials might suggest rea-
sons for p values near 0 and 1. None of the following
trials have been corrected or retracted. The trial with
the p value closest to 0 (p = 3.6 9 10�30) – that is,
generated by similar means – was JAMA 2002; 288:
2421. The authors or editors probably labelled SD
incorrectly as SE: the SD calculated from the ‘SE’ was
too large to be plausible and analyses assuming the
‘SE’ were SD increased p to 0.90. The same explana-
tion might apply to the second-smallest p value,
7.1 9 10�21, generated by NEJM 2008; 359: 119, and
JAMA 2001; 285: 1856, p = 3.3 9 10�11. However, a
single solution does not explain extreme p values in
other papers. For instance, NEJM 2007; 356: 911
reported mean (SE) tissue plasminogen activator con-
centrations of 4.5 (0.6) and 3.2 (0.4) in groups of 59
and 61, respectively: conversion of the SE to SD (4.6
and 3.1) resulted in a p value of 0.92, which is not
particularly near 1 (or 0). However, conversion of the
‘SE’ for the 19 other variables resulted in p values
averaging 0.02 and a composite trial p value of
4.1 9 10�13. One could construct a p value for this
trial that does not suggest data corruption if one pos-
ited that the SD of 19 variables were incorrectly
Table 2 Conversion of single-sided p values within 0.005 of 0 or 1 in Table 1 to two-sided p values < 0.01 for 5015
unretracted trials. Randomised, controlled trials with the least likely distributions of baseline variables might benefit
from further investigation. Values are number (proportion).
Total p < 0.00001
0.0001 >
p > 0.00001
0.001 >
p > 0.0001
0.01 <
p < 0.001
Total
p < 0.01*
Expected % < 0.001% 0.009% 0.09% 0.9% 1%
Anaesthesia 393 7 (1.8%) 1 (0.3%) 0 8 (2.0%) 16 (4.1%)
Anesthesia and Analgesia 1254 17 (1.4%) 3 (0.2%) 4 (0.3%) 28 (2.2%) 52 (4.1%)
Anesthesiology 537 9 (1.7%) 3 (0.6%) 7 (1.3%) 10 (1.9%) 29 (5.4%)
British Journal of Anaesthesia 614 5 (0.8%) 0 5 (0.8%) 7 (1.1%) 17 (2.8%)
Canadian Journal of Anesthesia 373 8 (2.1%) 1 (0.3%) 3 (0.8%) 9 (2.4%) 21 (5.6%)
European Journal of Anaesthesiology 397 5 (1.3%) 0 7 (1.8%) 7 (1.8%) 19 (4.8%)
Journal of the American
Medical Association
513 10 (1.9%) 1 (0.2%) 1 (0.2%) 10 (1.9%) 22 (4.3%)
New England Journal of Medicine 934 6 (0.6%) 2 (0.2%) 3 (0.3%) 18 (1.9%) 29 (3.1%)
Total 5015 67 (1.3%) 11 (0.2%) 30 (0.6%) 97 (1.9%) 204 (4.1%)
*p = 0.17 for chi-squared comparison of totals between journals.
948 © 2017 The Association of Anaesthetists of Great Britain and Ireland
Anaesthesia 2017, 72, 944–952 Carlisle | Sampling distributions in randomised, controlled trials
labelled SE, whereas the SE for tissue plasminogen
activator concentration were correct. Conversely, stan-
dard errors that are incorrectly labelled SD generate p
values close to 1 (using the methodology in this
paper), which might explain the p values of 1 gener-
ated for 4/11 variables in NEJM 2006; 355: 549. Some
Figure 2 A cumulative plot of ordered p values (0 to 1) for 5087 randomised, controlled trials, grouped by the journal
in which they were published. Each journal had more trials with baseline means that were similar (p value near 0) or
dissimilar (p value near 1) than expected, resulting in cumulative distributions that were inconsistent with the cumula-
tive uniform distribution ( ): Anaesthesia ( ), p = 1.5 9 10�6; Anesthesia and Analgesia ( ), p = 4.7 9 10�7;
Anesthesiology ( ), p = 1.1 9 10�6; British Journal of Anaesthesia ( ), p = 9.7 9 10�7; Canadian Journal of
Anesthesia ( ), p = 1.6 9 10�6; European Journal of Anaesthesiology ( ), p = 1.5 9 10�6; Journal of the American
Medical Association ( ), p = 1.2 9 10�6; New England Journal of Medicine ( ), p = 6.4 9 10�7.
© 2017 The Association of Anaesthetists of Great Britain and Ireland 949
Carlisle | Sampling distributions in randomised, controlled trials Anaesthesia 2017, 72, 944–952
journals publish p values for baseline data, which can
help determine sources of error. For instance, the
authors of NEJM 2010; 362: 790 calculated p = 0.07
for mean (SD) intelligence scores of 99.1 (16.6), 92.0
(14.5) and 100 (14.8) in groups of 155, 149 and 147,
respectively. The correct p value is 0.00000718.
Although 7 is the correct numeral it is unclear why
the authors’ p value was out by a magnitude of 10,000
or why the groups were so different for a baseline vari-
able. Similarly, NEJM 2004; 370: 2265 reported
p = 0.03 for mean (SD) central venous pressures of
9.0 (4.7) and 8.6 (4.6) in groups of 3,428 and 3,423,
respectively. The calculated p value is between 0.00037
and 0.0015, depending upon the method used.
I analysed variables as independent, that is, not
correlated. Correlated variables might explain the p
value of 9.6 9 10�4 for JAMA 2004; 291: 309 that
reported nine baseline variables (p values averaging
0.24), six of which one would expect to be correlated
as they were derived from the same exercise tests. The
correlation of three histologic scores in NEJM 2002;
346: 1706 probably accounts for the p value of
0.99998, as might the correlation of three osteoarthritic
scores in NEJM 2010; 363: 1521 that generated a p
value of 0.99997. Supplementary data are not always
exposed to the same rigour as those in the main paper,
by author or editor, which might have contributed to
the p value indistinguishable from 1 generated by
NEJM 2013; 368: 1279, a p value that cannot be
explained by substitution of SD for SE or correlation.
Values of p very near 0 and 1 may be generated
by incorrect means, incorrect SDs or incorrect partici-
pant numbers. Two trials in JAMA with p values near
zeroillustrate the probable unwitting replacement of
the correct numeral with another. Trial JAMA 2008;
299: 39 reported 31 baseline variables, 30 of which
generated p values in the normal range, whereas the
31st variable, mean (SD) subcutaneous fat depths of
2.6 (0.8) cm and 3.5 (0.8) cm in groups of 113 and
110, respectively, generated p = 5.5 9 10�15. Other
data in the paper suggest that the correct means were
2.6 cm and 2.5 cm, for which p = 0.35. Amidst 14 p
values in JAMA 2003; 289: 2215 the mean (SD) dietary
fat intakes of 37.2 (0.09) kcal and 38.2 (0.19) kcal in
groups of 230 and 220, respectively, generated
Figure 4 The expected cumulative uniform distribu-
tion ( ) was not followed by the p values for 5015
unretracted trials ( ), p = 1.2 9 10�7, or by the p
values from 72 retracted trials ( ), p = 8.6 9 10�6.
The cumulative distributions of unretracted and
retracted trials were different, p = 5.3 9 10�15.
0.0 0.2 0.4 0.6 0.8 1.0
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Trial p values from six anaesthetic journals
Tr
ia
l p
 v
al
ue
s 
fro
m
 tw
o 
ge
ne
ra
l m
ed
ic
al
 jo
ur
na
ls
Figure 3 A quantile–quantile plot for p values calcu-
lated for 1453 randomised, controlled trials published
in two non-specialist medical journals vs. 3634 ran-
domised, controlled trials published in six specialist
anaesthetic journals ( ). The distribution is consistent
with the reference unitary tangent ( ), p = 0.30.
950 © 2017 The Association of Anaesthetists of Great Britain and Ireland
Anaesthesia 2017, 72, 944–952 Carlisle | Sampling distributions in randomised, controlled trials
p < 10�16 (but reported as 0.60). I expect that the cor-
rect means were identical to one decimal place, proba-
bly both 37.2 or 38.2, but there might be a second
error as standard deviations in such large groups
should be similar.
Trials are retracted for various reasons. The retrac-
tion of two trials was triggered after I noticed duplica-
tion of baseline data: EJA 2004; 21: 60, which was the
entire republication of EJA 2003; 20: 668; and Anaesthe-
sia 2010; 65: 595, which presented secondary data from
a previously published cohort without reference. Eigh-
teen trials in this survey were authored by Yoshitaka
Fujii, all of which were retracted for data fabrication, as
were eight trials by Scott Reuben. Figure 5 compares the
26 trials retracted for fabrication with the 44 trials that
were retracted for inadequate ethical approval or unclear
reasons, such as ‘misconduct’. The distributions of base-
line data were different in the two subgroups,
p = 3.3 9 10�5, but neither was at all consistent with
the expected distribution, p = 1.9 9 10�5 and
p = 2.3 9 10�5, respectively. Trials retracted due to
unethical practice or ill-defined reasons might therefore
also contain corrupt data, due to error or fabrication.
Discussion
I analysed the proximity of means for baseline vari-
ables in 5087 randomised, controlled trials. In 15.6%
of trials, the probability of a more extreme distribution
was 1 in 10. Retracted trials had a higher proportion
of p values in the extreme 10% of the expected distri-
bution than trials that have not been retracted (43%
vs. 15%). There was evidence that trials retracted for
reasons other than data integrity may have contained
corrupt and possibly fabricated data. Trials with
extreme distributions of means were more likely to
contain incorrect or fabricated data than other trials,
as has been independently verified.
The discrepancy between the observed distribution
and the expected distribution of p values could be
because the expected distribution was wrong. I was
aware that stratified allocation could make group means
more similar, which is why I did not analyse the means
of stratified variables. However, the effect could have
‘carried over’ into non-stratified variables: for instance,
the mean weights of groups might have been made more
similar through stratification by sex or height. The
excess of dissimilar means would not be explained by
this mechanism, although correlations between variables
could account for excess trials with p values at either
extreme (near 0 or 1). Simulations could generate credi-
ble intervals for the contribution of stratification and
correlation with extreme p values. Investigators can
manipulate the distribution of participants into groups
if the allocation sequence is inadequately masked or if
the allocation sequence is predictable. The manipulation
of participant allocation could result in baseline mean
values that are similar or dissimilar, depending upon the
motivation and efficacy of the method used to distort
random allocation. The ‘observed’ distribution may have
been distorted by my mistakes. It is likely that I incor-
rectly transcribed some of the 72,261 means, 72,261
standard deviations, 72,261 participant numbers and
72,261 precisions for mean and SD.
Some trials with extreme p values probably con-
tained unintentional typographical errors, such as the
description of standard error as standard deviation and
vice versa. The more extreme the p value the more likely
Figure 5 The expected cumulative uniform distribu-
tion ( ) was not followed by the 26 trials retracted
for fabrication ( ), p = 1.9 9 10�5, composed of tri-
als by Fujii ( ) and Reuben (X); and it was not fol-
lowed by the 44 trials retracted for other reasons ( ),
p = 2.3 9 10�5, composed of trials by Boldt ( ) and
others ( ). The cumulative distributions of these two
categories of retracted articles were different,
p = 3.3 9 10�5.
© 2017 The Association of Anaesthetists of Great Britain and Ireland 951
Carlisle | Sampling distributions in randomised, controlled trials Anaesthesia 2017, 72, 944–952
there is to be an error (mine or the authors’), either
unintentional or fabrication. For instance, for 43/5015
unretracted and uncorrected trials, the probability that
random allocation would result in the distributions of
baseline means was less than 1 in 1015 (one water drop
in 20,000 Olympic-sized swimming pools). In a sample
of just over 5000 trials it seems reasonable to conclude
that these trials – and others with more likely distribu-
tions – almost certainly contain some sort of error. The
association of extreme distributions with trial retraction
suggests that further investigation of uncorrected unre-
tracted trials and their authors will result in most trials
being corrected and some retracted. The evidence for
this association in this survey comes mainly from spe-
cialist anaesthetic journals. It is unclear whether trials in
the anaesthetic journals have been more deserving of
retraction, or perhaps there is a deficit of retractions
from JAMA and NEJM.
In summary, the distribution of means for baseline
variables in randomised, controlled trials was inconsis-
tent with random sampling, due to an excess of very
similar means and an excess of very dissimilar means.
Fraud, unintentional error, correlation, stratified alloca-
tion and poor methodology might have contributed to
this distortion. The distortion in two non-specialist
medical journals was indistinguishable from that found
in six specialist anaesthetic journals. Future work
might determine whether this finding is general to all
randomised, controlled trials. Journal editors could use
Table 2 and online Appendix S1 to determine which
trials to correct and if necessary retract.
Acknowledgements
JC is an editor of Anaesthesia. No external funding or
other competing interests declared.
References
1. Carlisle JB. The analysis of 168 randomised controlled trials to
test data integrity. Anaesthesia 2012; 67: 521–37.
2. Carlisle JB, Dexter F, Pandit JJ, Shafer SL, Yentis SM. Calculating
the probability of random sampling for continuous variables in
submitted or published randomised controlled trials. Anaesthe-
sia 2015; 70: 848–58.
3. Pandit JJ. On statistical methods to test if sampling in trials is
genuinely random. Anaesthesia2012; 67: 456–62.
4. Carlisle JB, Loadsman JA. Evidence for non-random sampling in
randomised, controlled trials by Yuhji Saitoh. Anaesthesia
2017; 72: 17–27.
5. Bolland MJ, Avenell A, Gamble GD, Grey A. Systematic review
and statistical analysis of the integrity of 33 randomized con-
trolled trials. Neurology 2016; 87: 2391–402.
6. Retraction Watch. The Retraction Watch Leaderboard. http://
retractionwatch.com/the-retraction-watch-leaderboard/ (acces-
sed 17/01/2017).
Supporting Information
Additional Supporting Information may be found in
the online version of this article:
Appendix S1. Each row in this spreadsheet is one
published trial, identified by journal (sheet), with col-
umns for year, volume and first page. Consecutive col-
umns list one-sided and two-sided trial p values, the
latter of which can be sorted in order of probability.
The next six columns list the results of six different
methods for combining the p values of separate base-
line variables. The next column sums the number of
variables analysed for each trial, followed by the one-
sided p values for each variable
Appendix S2. This appendix lists the values that
were analysed: participant number; mean; standard
deviation; decimal places for mean; decimal places for
standard deviation. Values for trials published in dif-
ferent journals are on separate sheets, with the trial
number corresponding to that listed in Appendix S1.
952 © 2017 The Association of Anaesthetists of Great Britain and Ireland
Anaesthesia 2017, 72, 944–952 Carlisle | Sampling distributions in randomised, controlled trials
http://retractionwatch.com/the-retraction-watch-leaderboard/
http://retractionwatch.com/the-retraction-watch-leaderboard/

Continuar navegando