Evaluation of the Medical Literature

By admin On Oct 21, 2024

Ovid: Lovell & Winter’s Pediatric Orthopaedics

Editors: Morrissy, Raymond T.; Weinstein, Stuart L.

Title: Lovell & Winter’s Pediatric Orthopaedics, 6th Edition

> Table of Contents > VOLUME 1 > 4 – Evaluation of the Medical Literature

Evaluation of the Medical Literature

Mininder S. Kocher

Evaluation of the medical literature is an essential
task of the pediatric orthopaedic surgeon in order to determine the
efficacy of treatments, stay abreast of new technology, and provide
optimal patient care. However, this task can be daunting because the
clinician is inundated with medical information from scientific
journals, scientific meetings, the lay press, industry, and even the
Internet. Critical evaluation of the medical literature is vital in
order to distinguish the studies that are scientifically sound and
sufficiently compelling to warrant a change of practice from those that
are methodologically flawed or biased. A working understanding of
clinical epidemiology and biostatistics is necessary for critical
evaluation of the medical literature. This chapter provides an overview
of the concepts of study design, hypothesis testing, measures of
effect, diagnostic performance, evidence-based medicine (EBM), outcomes
assessment, and biostatistics. Examples from the orthopaedic literature
and a glossary of terminology (terms italicized throughout the text)
are provided.

STUDY DESIGN

Clinical research study design has evolved from
cataloguing vital statistics from birth and death records in the 1600s,
through correlational studies associating cholera with water
contamination, case-control studies linking smoking with lung cancer,
and prospective cohort studies such as the Framingham Heart Study, to
the randomized clinical trial (RCT) for the polio vaccine (1). The evidence-based medicine (EBM) and patient-derived outcomes
assessment movements burst onto the scene of clinical medicine in the
1980s and 1990s as a result of contemporaneous medical, societal, and
economic influences. Pioneers such as Sackett and Feinstein emphasized
levels of evidence and patient-centered outcomes assessment (2,3,4,5,6,7,8,9,10).
Work by Wennberg et al. revealed substantial small-area variations in
clinical practice, with some patients being 30 times more likely to
undergo an operative procedure than other patients with identical
symptoms just because of their geographic location (11,12,13,14,15,16).
Further critical research suggested that up to 40% of some surgical
procedures might be inappropriate and that up to 85% of common medical
treatments had not been rigorously validated (17,18,19).
Meanwhile, the costs of health care in the US were rapidly rising to
over 2 billion dollars per day, increasing from 5.2% of the gross
domestic product in 1960 to 16.2% in 1997 (20).
Health maintenance organizations and managed care emerged in recent
times. In addition, increasing federal, state, and consumer oversight
were brought to bear on the practice of clinical medicine. These forces
have led

P.98

to an increased focus on the effectiveness of clinical care and on the design of clinical research studies.

In observational studies, researchers observe patient groups without allocation of the intervention, whereas in experimental studies, researchers allocate the treatment. Experimental studies involving humans are called trials. Research studies may be retrospective,
which means that the direction of inquiry is backward from the cases
and that the events of interest have transpired before the onset of the
study, or they may be prospective, which
means that the direction of inquiry is forward from the cohort
inception and that the events of interest transpire after the onset of
the study (Fig. 4.1). Cross-sectional studies are used for surveying patients at one point in time. Longitudinal studies follow the same patients over multiple points in time.

All research studies are susceptible to invalid conclusions because of bias, confounding, and chance. Bias
is the nonrandom systematic error in the design or conduct of a study.
Bias is usually not intentional; however, it is pervasive and
insidious. Bias can corrupt a study at any phase, including patient
selection (i.e., selection and membership bias), study performance
(i.e., performance and information bias), patient follow-up (i.e.,
nonresponder and transfer bias), and outcome determination (i.e.,
detection, recall, acceptability, and interviewer bias). Frequent
biases in the orthopaedic literature include selection bias when unlike
groups are being compared, nonresponder bias in studies with low
follow-up rates, and interviewer bias when the investigator is
determining outcome. A confounder is a variable having independent associations with both the independent (predictor) and the dependent
(outcome) variables, thereby potentially distorting the relation
between them. For example, an association between knee laxity and
anterior cruciate ligament injury may be confounded by female gender
because women may have greater knee laxity and a higher risk of
anterior cruciate ligament injury. Frequent confounders in clinical
research include gender, age, socioeconomic status, and comorbidities.
As discussed in the section on hypothesis testing, chance may lead to invalid conclusions based on the probability of type-I and type-II errors, which are related to P values and power.

Figure 4.1 Prospective versus retrospective study design, defined on the basis of the direction of inquiry and on the onset of the study.

The adverse effects of bias, confounding, and chance can
be minimized by study design and statistical analysis. Prospective
studies minimize the bias associated with patient selection, quality of
information, attempts to recall preoperative status, and nonresponders.
Randomization minimizes selection bias and equally distributes confounders. Blinding can further decrease bias, and matching can decrease confounding. Confounders can sometimes be controlled post hoc
by the use of stratified analysis or multivariable methods. The effects
of chance can be minimized by an adequate sample size on the basis of power
calculations and use of appropriate levels of significance in
hypothesis testing. The ability of study design to optimize validity
while minimizing bias, confounding, and chance is acknowledged by the
adoption of hierarchic levels of evidence based on study design (Table 4.1).

Observational study designs include case series, case-control studies, cross-sectional surveys, and cohort studies. A case series
is a retrospective, descriptive account of a group of patients with
interesting characteristics or of a series of patients who have
undergone an intervention. A case series of one patient is a case report.
Case series are easy to construct and can provide a forum for the
presentation of interesting or unusual observations. However, case
series are often anecdotal, are subject to many possible biases, lack a
hypothesis, and are difficult to compare with other series. Therefore,
case series are usually viewed as a means of generating hypotheses for
further studies but are not viewed as conclusive.

A case-control study is one
in which the investigator identifies patients with an outcome of
interest (cases) and patients without the outcome (controls) and then
compares the two groups in terms of possible risk factors. The effects
in a case-control study are frequently reported in terms of the odds ratio.
Case-control studies are efficient (particularly for the evaluation of
unusual conditions or outcomes) and are relatively easy to perform.
However, an appropriate control group may be difficult to identify, and
preexisting high-quality medical records are essential. Moreover,
case-control studies are susceptible to multiple biases, particularly
selection and detection bias based on the identification of cases and
controls.

Cross-sectional surveys are
often used for determining the prevalence of disease or for identifying
coexisting associations in patients who have a particular condition at
one particular point in time. Prevalence
of a condition is the number of individuals with the condition divided
by the total number of individuals at a particular point in time. Incidence,
in contradistinction, refers to the number of individuals with the
condition divided by the total number of individuals over a defined
time period. Thus, prevalence data are usually obtained from a
cross-sectional survey and are expressed as a proportion, whereas
incidence data are

P.99

P.100

usually
obtained from a prospective cohort study and contain a time value in
the denominator. Surveys are also frequently performed to determine
preferences and treatment patterns. Because cross-sectional studies
represent a snapshot in time, they may be misleading if the research
question involves the disease process over time. Surveys also present
unique challenges in terms of adequate response rate, representative
samples, and acceptability bias.

TABLE 4.1 LEVELS OF EVIDENCE FOR PRIMARY RESEARCH QUESTION USED BY THE JOURNAL OF BONE AND JOINT SURGERY

Types of Studies
	Therapeutic Studies— Investigating the Results of Treatment	Prognostic Studies— Investigating the Outcome of Disease	Diagnostic Studies— Investigating a Diagnostic Test	Economic and Decision Analyses— Developing an Economic or Decision Model
Level I	Randomized controlled trial Significant difference No significant difference but narrow confidence intervals Systematic review^b of Level-1 randomized controlled trials (studies were homogeneous	Prospective study^a Systematic review^b of Level-I studies	Testing of previously developed diagnostic criteria in series of consecutive patients (with universally applied reference “gold” standard) Systematic review^b of Level-I studies	Clinically sensible costs and alternatives; values obtained from many studies; multiway sensitivity analyses Systematic review^b of Level-I studies
Level II	Prospective cohort study^c Poor-quality randomized controlled trial (e.g., < 80% follow-up) Systematic review^b Level-II studies Level-II studies Nonhomogeneous	Retrospective study^d Study of untreated controls from a previous randomized controlled trial Systematic review^b of Level-I studies	Development of diagnostic criteria based on consecutive patients (with universally applied reference “gold” standard) Systematic review^b of Level-II studies	Clinically sensible costs and alternatives; values obtained from limited studies; multiway sensitivity analyses Systematic review^b of Level-II studies
Level-III	Case-control study^e Retrospective cohort study^d Systematic review^b of Level-III studies	—	Study of nonconsecutive patients (no consistently applied reference “gold” standard) Systematic review^b of Level-III studies	Limited alternatives and costs; poor estimates Systematic review^b of Level-III studies)
Level IV	Case series (no, or historic, control group)	Case series	Case-control study Poor reference standard	No sensitivity analyses
Level V	Expert opinion	Expert opinion	Expert opinion	Expert opinion
^aAll patients were enrolled at the same point in their disease course (inception cohort) with greater than or equal to 80% follow-up of enrolled patients. ^bA study of results from two or more previous studies. ^cPatients were compared with a control group of patients treated at the same time and institution. ^dThe study was initiated after treatment was performed. ^ePatients with a particular outcome (“cases” with, e.g., a failed total arthroplasty) were compared with those who did not have the outcome (“controls” with, e.g., a total hip arthroplasty that did not fail).

A traditional cohort study
is one in which a population of interest is identified and followed
prospectively in order to determine outcomes and associations with risk
factors. Retrospective cohort studies, or historic cohort studies, can
also be performed, in which cohort members are identified on the basis
of records, and the follow-up period occurs entirely or partly in the
past. Cohort studies are optimal for studying the incidence, course,
and risk factors of a disease because they are longitudinal, which
means that a group of subjects is followed over time. The effects in a
cohort study are frequently reported in terms of relative risk.
Because traditional cohort studies are prospective, they can optimize
follow-up and data quality and can minimize the bias associated with
selection, information, and measurement. In addition, they have the
correct time sequence to provide strong evidence about associations.
However, these studies are costly, are logistically demanding, often
require long periods for completion, and are inefficient in assessing
unusual outcomes or diseases.

Experimental study designs may involve the use of concurrent controls, sequential controls (crossover trials), or historic controls. The randomized clinical trial (RCT) with concurrent controls is the so-called gold standard
of clinical evidence because it provides the most valid conclusions
(internal validity) by minimizing the effects of bias and confounding.
A rigorous randomization with enough patients is the best means of
avoiding confounding. The setting up of an RCT involves the
construction of a protocol document that explicitly establishes
eligibility criteria, sample size, informed consent, randomization,
stopping rules, blinding, measurement, monitoring of compliance,
assessment of safety, and data analysis. Because allocation is random,
the selection bias is minimized, and confounders (known and unknown)
are equally distributed, in theory, between groups. Blinding minimizes
performance, detection, interviewer, and acceptability bias. Blinding
may be practiced at four levels: (a) participants, (b) investigators
applying the intervention, (c) outcome assessors, and (d) analysts. Intention-to-treat analysis
minimizes nonresponder and transfer bias, whereas sample-size
determination ensures adequate power. The intention-to-treat principle
states that all patients should be analyzed within the treatment group
to which they were randomized in order to preserve the goals of
randomization. Although the RCT is the epitome of clinical research
designs, the disadvantages of RCTs include their high expense, logistic
complexities, and length of time to completion. Accrual of patients and
acceptance by clinicians may be problematic to achieve. With rapidly
evolving technology, a new technique may quickly become well accepted,
making an existing RCT obsolete or a potential RCT difficult to accept.
In terms of ethics, RCTs require clinical equipoise (i.e., equality of
treatment options in the clinician’s judgment) for enrollment, interim
stopping rules to avoid harm and evaluate adverse events, and truly
informed consent. Finally, although RCTs have excellent internal
validity, some have questioned their generalizability (external
validity) because the practice pattern and the population of patients
enrolled in an RCT may be overly constrained and nonrepresentative.

Ethical considerations are intrinsic to the design and
conduct of clinical research studies. Informed consent is of paramount
importance and it is the focus of much of the activity of Institutional
Review Boards. Investigators should be familiar with the Nuremberg Code
and the Declaration of Helsinki because they pertain to ethical issues
of risks and benefits, protection of privacy, and respect for autonomy (21,22).

HYPOTHESIS TESTING

The purpose of hypothesis testing is to permit
generalizations from a sample to the population from which it came.
Hypothesis testing either confirms or refutes the assertion that the
observed findings did not occur by chance alone, but because of a true
association between variables. By default, the null hypothesis of a study asserts that there is no significant association between variables, whereas the alternative hypothesis
asserts that there is a significant association. If the findings of a
study are not significant we cannot reject the null hypothesis, whereas
if the findings are significant we can reject the null hypothesis and
accept the alternative hypothesis.

Therefore, all research studies that are based on a
sample make an inference about the truth in the overall population. By
constructing a table of the possible outcomes of a study (Table 4.2), we can see that the inference of a study is correct if no significant association is

P.101

found when there is no true association or if a significant association
is found when there is a true association. However, a study can have
two types of errors: A type-I or alpha (α) error
occurs when a significant association is found when there is no true
association (resulting in a “false-positive” study that rejects a true
null hypothesis); a type-II or beta (β) error
occurs when there is a significant association, but the study wrongly
concludes that there is none (resulting in a “false-negative” study
that rejects a true alternative hypothesis).

TABLE 4.2 HYPOTHESIS TESTING

	Truth
Experiment	No Association	Association
No Association	Correct	Type-II (β) error
Association	Type-I (α) error	Correct
P value: probability of type-I (α) error Power: 1 – probability of type-II (β) error

The α level refers to the probability of the type-I (α)
error. By convention, the α level of significance is set at 0.05, which
means that we accept the finding of a significant association if there
is less than a one in 20 possibility that the observed association was
due to chance alone. The P value, which is
calculated from a statistical test, is therefore a measure of the
extent to which the data favors the null hypothesis. If the P
value is less than the α level (0.05), then the evidence against the
null hypothesis is strong enough to reject it, and the conclusion will
be that the result is statistically significant. The P
values are frequently used in clinical research and are given great
importance by journals and readers. However, there is a strong movement
in biostatistics to deemphasize P values
because: (a) a significance level of P <0.05 is arbitrary; (b) a
strict cutoff point can be misleading (there is little difference
between P = 0.049 and P = 0.051 yet only the former is considered
“significant”); (c) the P value gives no information about the strength of the association, and (d) the P
value may be statistically significant without the results being
clinically important. Alternatives to the traditional reliance on P
values include the use of variable α levels of significance based on
the consequences of the type-I error, and the reporting of P values without using the term “significant.” The use of 95% confidence intervals instead of P
values has gained acceptance, because these intervals convey
information about the significance of the findings (i.e., 95%
confidence intervals do not overlap if they are significantly
different), the magnitude of the differences, and the precision of
measurement (indicated by the range of the 95% confidence interval).

Power is the probability of
finding a significant association if it truly exists and is defined as
the difference between 1 and the probability of a type-II (β) error. By
convention, the acceptable power is set at 80% or more, which means
there is a 20% or less chance that the study will demonstrate no
significant association when there is an association. In practice, when
a study demonstrates a significant association, the potential error of
concern is the type-I (α) error, as expressed by the P
value. However, when a study demonstrates no significant association,
the potential error of concern is the type-II (β) error, as expressed
by power. That is, in a study that demonstrates no significant effect,
there may truly be no significant effect; on the other hand, there may
actually be a significant effect but the study was too underpowered to
demonstrate it because the sample size may have been too small or the
measurements may have been too imprecise. Therefore, in a study that
demonstrates no significant effect, the power of the study should be
reported. The calculations for power analyses differ depending on the
statistical methods used in the analysis. Four elements are involved in
a power analysis: α, β, effect size, and sample size (n). Effect size
is the difference that one wants to be able to detect with the given α
and β It is based on a clinical sense about what extent of difference
would be clinically meaningful. Effect sizes are often defined as
dimensionless terms on the basis of a difference in mean values divided
by the pooled standard deviation for a comparison of two groups. Low
sample sizes, small effect sizes, and large variance decrease the power
of a study. An understanding of power issues is important in clinical
research in order to minimize resources when planning a study and to
ensure the validity of the study. Sample size
calculations are performed when planning a study. Typically, power is
set at 80%, α level is set at 0.05, the effect size and variance are
estimated from pilot data or the literature, and the equation is solved
for the necessary sample size. The calculation of power after the study
has been completed (i.e., post hoc power analysis) is controversial and is not recommended.

DIAGNOSTIC PERFORMANCE

A diagnostic test can result in four possible scenarios: (a) true positive if the test is positive and if the disease is present, (b) false positive if the test is positive and if the disease is absent, (c) true negative if the test is negative and if the disease is absent, and (d) false negative if the test is negative and if the disease is present (Table 4.3). The sensitivity of a

P.102

test is the percentage (or proportion) of patients who have the disease
and are classified positive (true-positive rate). A test with 97%
sensitivity implies that of 100 patients with disease, 97 will show
positive test results. Sensitive tests have a low false-negative rate.
A negative result on a highly sensitive test rules disease out (SNout).
The specificity
of a test is the percentage (or proportion) of patients without the
disease who are classified negative (true-negative rate). A test with
91% specificity implies that of 100 patients without the disease, 91
will show negative test results. Specific tests have a low
false-positive rate. A positive result on a highly specific test rules
disease in (SPin). Sensitivity and specificity can be combined into a
single parameter, the likelihood ratio (LR),
which is the probability of a true positive divided by the probability
of a false positive. Sensitivity and specificity can be established in
studies in which the results of a diagnostic test are compared with the
gold standard of diagnosis in the same patients [e.g., by comparing the
results of magnetic resonance imaging with arthroscopic findings (23)].

TABLE 4.3 DIAGNOSTIC TEST PERFORMANCE

	Disease Positive	Disease Negative
Test Positive	a (true positive)	b (false positive)
Test Negative	c (false negative)	d (true negative)
Sensitivity:	a/(a + c)
Specificity:	d/(b + d)
Accuracy:	(a + c)/(a + b + c + d)
False-negative rate:	1 – sensitivity
False-positive rate:	1 – specificity
Likelihood ratio (+)	sensitivity/false-positive rate
Likelihood ratio (-)	false-negative rate/specificity
Positive predictive value:	[(prevalence)(sensitivity)]/[(prevalence)(sensitivity) + (1 – prvalence)(1 – specificity)]
Negative predictive value:	[(1 – prevalence)(specificity)]/[(1 – prevalence)(specificity) + (prvalence)(1 – specificity)]

Sensitivity and specificity are technical parameters of
diagnostic testing performance and have important implications for
screening and clinical practice guidelines (CPGs) (24,25);
however, they are less relevant in the typical clinical setting because
the clinician does not know whether the patient has the disease. The
clinically relevant questions are the probability that a patient has
the disease given a positive result [positive predictive value (PPV)] and the probability that a patient does not have the disease given a negative result [negative predictive value (NPV)].
The PPVs and NPVs are probabilities that require an estimate of the
prevalence of the disease in the population and can be calculated using
equations that utilize Bayes theorem (26).

There is an inherent trade-off between sensitivity and
specificity. Because there is typically some overlap between the
diseased and nondiseased groups with respect to a test distribution,
the investigator can select a positivity criterion with a low
false-negative rate (to optimize sensitivity) or a criterion with a low
false-positive rate (to optimize specificity) (Fig. 4.2).
In practice, positivity criteria are selected on the basis of the
consequences of a false-positive or a false-negative diagnosis. If the
consequences of a false-negative diagnosis outweigh the consequences of
a false-positive diagnosis of a condition [e.g., septic arthritis of
the hip in children (27)], a more sensitive
criterion is chosen. This relation between the sensitivity and
specificity of a diagnostic test can be portrayed by using a receiver-operating characteristic (ROC) curve. An ROC graph shows the relation between the true-positive rate (sensitivity) on the y axis and the false-positive rate (1–specificity) on the x axis plotted at each possible cutoff (Fig. 4.3). Overall diagnostic performance can be evaluated from the area under the ROC curve (28).

Figure 4.2
Selecting a positivity criterion. Because there typically is overlap
between the diseased population and the nondiseased population over a
range of diagnostic values (x axis), there
is an intrinsic trade-off between sensitivity and specificity.
Identifying positive test results to the right of cutoff point A, there
is high sensitivity because most diseased patients are correctly
identified as positive; however, there is lower specificity because
some nondiseased patients are incorrectly identified as positive (false
positives). Identifying positive test results to the right of cutoff
point B, there is lower sensitivity because some diseased patients are
incorrectly identified as negative (false negatives); however, there is
high specificity because most nondiseased patients are correctly
identified as negative.

MEASURES OF EFFECT

Measures of likelihood include probability and odds. Probability
is a number, between 0 and 1, that indicates how likely an event is to
occur on the basis of the number of events per number of trials. The
probability of heads on a coin toss is 0.5. Odds
refers to the ratio of the probability of an event occurring to the
probability of the event not occurring. The odds of flipping a heads on
a coin toss is 1 (0.5/0.5). Because probability and odds are related,
they can be converted; odds = probability/(1 – probability).

Relative risk (RR)
can be determined in a prospective cohort study, where RR equals the
incidence of disease in the exposed cohort divided by the incidence of
disease in the nonexposed cohort (Table 4.4). A similar measurement in a retrospective case-control study (where incidence cannot be determined) is the odds ratio (OR),
which is the ratio of the odds of a patient in the study group having
the disease compared with the odds of a patient in the control group
having the same disease (Table 4.4). For
example, a prospective cohort study of
anterior-cruciate-ligament–deficient skiers that finds a significantly
higher proportion of subsequent knee injuries in nonbraced (12.7%)
versus braced (2.0%) skiers may report a risk ratio of 6.4 (12.7%/2.0%)
(29). This report can be interpreted to mean
that a nonbraced anterior-cruciate-ligament–deficient skier has a 6.4
times higher risk of subsequent knee injury than a braced skier.

Factors that are likely to increase the incidence, prevalence, morbidity, or mortality of a disease are called risk factors.

P.103

The effect of a factor that reduces the probability of an adverse outcome can be quantified by the relative risk reduction (RRR), the absolute risk reduction (ARR), and the number needed to treat (NNT) (Table 4.4). The effect of a factor that increases the probability of an adverse outcome can be quantified by the relative risk increase (RRI), the absolute risk increase (ARI), and the number needed to harm (NNH) (Table 4.4).

Figure 4.3
Receiver-operating characteristic (ROC) curve for a clinical prediction
rule to differentiate septic arthritis from transient synovitis of the
hip in children (27). The false-positive rate
(1 – specificity) is plotted on the x axis, and sensitivity is plotted
on the y axis. The area under the curve represents the overall
diagnostic performance of a prediction rule or a diagnostic test. For a
perfect test, the area under the curve is 1.0. For random guessing, the
area under the curve is 0.5. (From Kocher MS, Zurakowski D, Kasser JR.
Differentiating between septic arthritis and transient synovitis of the
hip in children: an evidence-based clinical prediction algorithm. J Bone Joint Surg 1999;81A:1662–1670, with permission.)

OUTCOMES ASSESSMENT

Process refers to the medical care that a patient receives, whereas outcome
refers to the result of that medical care. The emphasis of the outcomes
assessment movement has been on patient-derived outcomes assessment.
Measures of outcomes include generic, condition-specific, and patient
satisfaction (30).

TABLE 4.4 TREATMENT EFFECTS

	Adverse Events	No Adverse Events
Experimental Group	a	b
Control Group	c	d
Control Event Rate (CER):	c/(c + d)
Experimental Event Rate (EER):	a/(a + b)
Control Event Odds (CEO):	c/d
Experimental Event Odds (EEO):	a/b
Relative Risk (RR):	EER/CER
Odds Ratio (OR):	EEO/CEO
Relative Risk Reduction (RRR):	(EER – CER)/CER
Absolute Risk Reduction (ARR):	EER – CER
Number Needed to Treat (NNT):	1/ARR
Number Needed to Harm (NNH):	1/ARI

Generic measures, such as
the Short Form-36 (SF-36), are used for assessing health status or
health-related quality of life, on the basis of the World Heath
Organization’s multiple-domain definition of health (31,32).

Condition-specific measures,
such as the International Knee Documentation Committee (IKDC) knee
score or the Constant shoulder score, are used for assessing aspects of
a specific condition or body system.

Measures of patient satisfaction
are used for assessing various components of care and have diverse
applications, including quality of care, health care delivery,
patient-centered models of care, and continuous quality improvement (33,34,35,36).

The process of developing an outcomes instrument
involves identifying the construct, devising items, scaling responses,
selecting items, forming factors, and creating scales. A large number
of outcomes instruments have been developed and used without formal
psychometric assessment of their reliability, validity, and
responsiveness to change.

Reliability refers to the repeatability of an instrument. Interobserver and intraobserver reliability
refer to the repeatability of the instrument when used by different
observers or by the same observer at different time-points,
respectively. Test–retest reliability can
be assessed by using the instrument to evaluate the same patient on two
different occasions without an interval change in medical status. These
results are usually reported using the kappa statistic or intraclass correlation coefficient.

P.104

Validity refers to whether the instrument measures what it purports to measure. Content validity
assesses whether an instrument is representative of the characteristic
being measured, using expert consensus opinion (face validity). Criterion validity assesses an instrument’s relationship to an accepted, “gold-standard” instrument. Construct validity
assesses whether an instrument follows accepted hypotheses (constructs)
and produces results consistent with theoretical expectations.

Responsiveness to change assesses how an instrument’s values change over the course of the disease and its treatment.

EVIDENCE-BASED MEDICINE

EBM involves the
conscientious, explicit, and judicious use of current best evidence in
making decisions about the care of individual patients (37).
EBM integrates best research evidence with clinical expertise and
patient values. The steps of EBM involve: (a) converting the need for
information into an answerable question; (b) tracking down the best
evidence to answer that question; (c) critically appraising the
evidence with regard to its validity, impact, and applicability; and
(d) integrating the critical appraisal with clinical expertise and the
patient’s unique values and circumstances (38,39).
The types of questions asked in EBM are foreground questions pertaining
to specific knowledge about managing patients who have a particular
disorder. Evidence is graded on the basis of study design (Table 4.1),
with an emphasis on RCTs, and can be found on evidence-based databases
[e.g., Evidence-Based Medicine Reviews (EBMR) from Ovid Technologies,
the Cochrane Database of Systematic Reviews, Best Evidence, Clinical
Evidence, National Guidelines Clearinghouse, CancerNet, and Medline]
and evidence-based journals (e.g., Evidence-Based Medicine and ACP Journal Club).

Figure 4.4 Expected-value decision analysis tree for operative versus nonoperative management of acute Achilles tendon rupture (45).
Decision nodes are represented by □, chance nodes are represented by ^,
and terminal nodes are represented by [black up-pointing triangle] Mean
outcome utility scores are listed to the right of the terminal node
(0–10). Outcome probabilities are listed under the terminal node title
(0–1). Operative treatment is favored because it has a higher expected
value (6.52 versus 6.28). (From Kocher MS, Bishop J, Luke A, et al.
Operative vs nonoperative management of acute achilles tendon ruptures:
expected-value decision analysis. Am J Sports Med 2002;30: 783–790, with permission.)

A systematic review (SR)
is a summary of the medical literature in which explicit methods are
used to perform a thorough literature search and a critical appraisal
of studies. A more specialized type of SR is a meta-analysis,
in which quantitative methods are used to combine the results of
several independent studies (usually RCTs) to produce statistical
summaries. For example, a study that systematically reviews the
literature (with inclusion and exclusion criteria for studies) about
internal fixation versus arthroplasty for femoral neck fractures and
then summarizes the subsequent outcomes and complications would be
considered a SR. On the other hand, a study that systematically reviews
the literature (with inclusion and exclusion criteria for studies) and
then combines the patients to perform new statistical analyses would be
considered a meta-analysis (40).

Clinical pathways or clinical practice guidelines (CPG)
are algorithms that are developed, on the basis of the best available
evidence, to standardize processes and optimize outcomes. They may also
potentially reduce errors of omission and commission, reduce variations
in practice patterns, and decrease costs (41).

Decision analysis is a methodologic tool that allows for the quantitative analysis of decision making under conditions of uncertainty (42,43,44).
The rationale underlying explicit decision analysis is that a decision
must be made, often under circumstances of uncertainty, and that
rational decision theory optimizes expected value. The process of
expected-value decision analysis involves the creation of a decision
tree to structure the decision problem, determination of outcome
probabilities and utilities (patient values), fold-back analysis to
calculate the expected value of each decision path to determine the
optimal decision-making strategy (Fig. 4.4), and sensitivity analysis to determine the effects of varying outcome probabilities and utilities on decision making (Fig. 4.5). Decision analysis can identify the

P.105

optimal decision strategy and how this strategy changes with variations
in outcome probabilities or patient values. This process, whether used
explicitly or implicitly, integrates well with the newer doctor–patient
model of shared decision making.

Figure 4.5 Sensitivity analysis for operative versus nonoperative management of acute Achilles tendon rupture (45).
The probability of wound complication from operative treatment is shown
on the x axis. The lines represent the expected value for the operative
and nonoperative decisions. Above the threshold value (i.e.,
probability of wound complication from operative treatment = 21%),
nonoperative treatment is favored. (From Kocher MS, Bishop J, Luke A,
et al. Operative vs nonoperative management of acute achilles tendon
ruptures: expected-value decision analysis. Am J Sports Med 2002;30:783–790, with permission.)

In the field of medicine, economic evaluation study
designs include cost-identification studies, cost-effectiveness
analysis, cost-benefit analysis, and cost-utility analysis (46,47).

In cost-identification studies, the costs of providing the treatment are identified.

In cost-effectiveness analysis, the costs and clinical outcome are assessed and reported as cost per clinical outcome.

In cost-benefit analysis, both costs and benefits are assessed in monetary units.

In cost-utility analysis, cost and utility are measured and reported as cost per quality-adjusted life-year (QALY).

BIOSTATISTICS

The scale on which a characteristic is measured has
implications for the way in which information is summarized and
analyzed. Data can be categorical, ordinal, or continuous. Categorical data
indicate types or categories and can be thought of as counts.
Categories do not represent an underlying order. Examples include
gender and a dichotomous (yes/no, successful/failure) outcome. Also
called nominal data, categorical data are generally described in terms of proportions or percentages and are reported in tables or bar charts.

If there is an inherent order among categories, then the data are ordinal.
The numbers that are used represent an order but are not necessary to
scale. Examples include cancer stages and injury grades. Ordinal data
are also generally described in terms of proportions or percentages and
are reported in tables or bar charts.

Continuous data are
observations on a continuum for which the differences between numbers
have meaning on a numerical scale. Examples include age, weight, and
distance. When a numerical observation can take on only integer values,
the scale of measurement is called discrete. Continuous data are generally described in terms of mean and standard deviation and can be reported in tables or graphs.

Data can be summarized in terms of measures of central tendency, such as mean, median, and mode, and in terms of measures of dispersion, such as range, standard deviation, and percentiles.
Data can be characterized by different distributions, such as the
normal (Gaussian) distribution, skewed distributions, and bimodal
distributions (Fig. 4.6).

Survivorship analysis is used for analyzing data when
the outcome of interest is time until an event occurs. A group of
patients is monitored to see if they experience the event of interest.
The endpoint in survivorship analysis can be mortality or a clinical
endpoint such as revision of a total joint replacement. Typically,
survivorship data are analyzed using the Kaplan-Meier (KM)
product-limit method and are depicted graphically by KM curves (Fig. 4.7) (48,49,50).

Univariate or bivariate
analyses assess the relation of an independent variable to a dependent
variable. The commonly used statistical tests and their indications are
listed in Table 4.5. Multivariate analysis explores relations between multiple variables. Regression is a method of obtaining a mathematic relationship between an outcome variable (Y) and an explanatory variable (X) or a set of independent variables (X_iS) Linear regression is used when the outcome variable is continuous, and the goal is to find the line that best predicts Y from X.
Logistic regression, which is used when the outcome variable is binary
or dichotomous, has become the most common form of multivariate
analysis for non–time-related outcomes. Other regression methods
include time-to-event data (e.g., Cox proportional-hazards

P.106

regression) and count data (e.g., Poisson regression). Regression modeling is commonly used to predict outcomes (Table 4.6)
or to establish independent associations (controlling for confounding
and collinearity) among predictor or explanatory variables. For
example, logistic regression can be used for determining predictors of
septic arthritis versus transient synovitis of the hip in children from
an array of presenting demographic, laboratory, and imaging variables (27).
Similarly, linear regression can be used for arriving at independent
determinants of patient outcome, measured by using a continuous outcome
instrument (36). Because many variables usually influence a particular outcome, it is often necessary to use multivariate analysis.

Figure 4.6 Data distributions.

CASE EXAMPLE: SCOLIOSIS

The management of adolescent idiopathic scoliosis is
illustrative of the impact of the medical literature on the
understanding of a pediatric orthopaedic condition and on the
importance of well-designed clinical research studies.

Figure 4.7
Kaplan-Meier estimated survivorship curves comparing survival rates
between patients who had osteosarcoma with a pathologic fracture and
those without a fracture (51). The estimated
rates were significantly lower for patients with a pathologic fracture
(log-rank test = 5.19; P = 0.02). The error bars around the
survivorship curves represent 95% confidence intervals derived by
Greenwood formula. The number of patients on whom the estimates were
based are shown in parentheses.

Hippocrates ascribed chronic scoliosis to poor posture:
“lateral curvatures also occur, the proximate cause of which is the
attitudes in which these patients lie” (52).
This concept of a postural etiology persisted for the next two thousand
years and was supported by the writings of Nicolas Andry in the 1700s,
James Paget in the 1800s, and Robert Lovett in the 1900s (53).
Medical literature describes the various methods of treatment that were
subsequently developed, including different kinds of braces and
appliances, traction devices, gymnastic exercises, subcutaneous
tenotomy, and plaster of paris casts (53). Modern surgical management of scoliosis is based on the results of case series of in situ fusion in 360 patients by Hibbs et al. in 1931 and fusion with instrumentation in 129 patients by Harrington in 1962 (54,55).

An evidence-based approach to the management of
scoliosis was advocated. In 1941, a committee of the American
Orthopaedic Association, headed by Alfred Shands, investigated the
treatment of scoliosis in the United States and reviewed the records of
425 patients (56). This committee concluded
that bracing and exercise programs were effective only in some patients
and that those with progressive

P.107

deformity
were best treated with correction followed by fusion. John Moe
established the Scoliosis Research Society in 1966. Classical
epidemiologic methods were utilized to establish the incidence of
scoliosis and to study the value of screening programs (57,58). A nomogram was developed for the prediction of curve progression to aid in planning treatment and advising families (59). Case series purported to support the efficacy of the Milwaukee brace and the Boston brace (60,61). The long-term results of natural history, bracing, and surgery were reported (62,63,64,65). The health, functioning, and psychosocial characteristics of patients with idiopathic scoliosis were investigated (66,67).

TABLE 4.5 STATISTICAL TESTS FOR COMPARING INDEPENDENT GROUPS AND PAIRED SAMPLES

Type of Data	Number of Groups	Independent Groups	Paired Samples
Continuous
Normal	2	Student t test	Paired t test
Nonnormal	2	Mann-Whitney U test	Wilcoxon signed rank test
Normal	3 or more	ANOVA	Repeated-measures ANOVA
Nonnormal	3 or more	Kruskal-Wallis test	Friedman test
Ordinal	2	Mann-Whitney U test	Wilcoxon signed rank test
	3 or more	Kruskal-Wallis test	Friedman test
Nominal	2	Fisher exact test	McNemar test
	3 or more	Pearson chi-square test	Cochran Q test
Survival	2 or more	Log-rank test	Conditional logistic regression
ANOVA, analysis of variance.

TABLE 4.6 MULTIVARIATE METHODS: LOGISTIC REGRESSION FOR PREDICTION OF SEPTIC ARTHRITIS OF THE HIP IN CHILDREN

History of Fever	Non-Weight Bearing	Erythrocyte Sedimentation Rate 40 mm/h	Serum White Blood Cell Count >12,000 (× 1,000/mm³)	Predicted Probability of Septic Arthritis
Yes	Yes	Yes	Yes	99.8%
Yes	Yes	Yes	No	97.3%
Yes	Yes	No	Yes	95.2%
Yes	Yes	No	No	57.8%
Yes	No	Yes	Yes	95.5%
Yes	No	Yes	No	62.2%
Yes	No	No	Yes	44.8%
Yes	No	No	No	5.3%
No	Yes	Yes	Yes	93.0%
No	Yes	Yes	No	48.0%
No	Yes	No	Yes	33.8%
No	Yes	No	No	3.4%
No	No	Yes	Yes	35.3%
No	No	Yes	No	3.7%
No	No	No	Yes	2.1%
No	No	No	No	1 in 700
From Kocher MS, Zurakowski D, Kasser JR. Differentiating between septic arthritis and transient synovitis of the hip in children: an evidence-based clinical prediction algorithm. J Bone Joint Surg 1999;81A:1662–1670, with permission.

Future clinical research in scoliosis plans to utilize highly rigorous epidemiologic methods to improve the

P.108

effectiveness of treatment. RCTs of different spinal instrumentation
systems are under way. A large, multicenter trial on the effectiveness
of bracing for adolescent idiopathic scoliosis has been proposed.
Further inquiry into the effect of scoliosis on health-related quality
of life has also been advocated.

Glossary

Absolute risk reduction (ARR)

Difference in risk of adverse outcomes between experimental and control participants in a trial.

Alpha (type I) error

Error in hypothesis testing where a
significant association is found when there is no true significant
association (rejecting a true null hypothesis). The αlevel is the
threshold of statistical significance established by the researcher (P
<0.05 by convention).

Analysis of variance (ANOVA)

Statistical test to compare mean values among three or more groups (F test).

Beta (type II) error

Error in hypothesis testing where no
significant association is found when there is a true significant
association (rejecting a true alternative hypothesis).

Bias

Systematic error in the design or conduct of a study. This threatens the validity of the study.

Blinding

Element of study design in which
patients and/or investigators do not know who is in the treatment group
and who is in the control group. The term masking is often used.

Case series

Retrospective observational study
design that describes a series of patients with an outcome of interest
or who have undergone a particular treatment. There is no control group
in a case series.

Case-control study

Retrospective observational study
design that involves identifying cases with outcome of interest, and
controls without such an outcome, and then looking back to see if these
outcomes had exposure of interest.

Categorical data

Variables whose values are categories (nominal variable, qualitative data).

Censored data

In survivorship analysis, an
observation whose outcome is unknown because the patient has not had
the event of interest or is no longer being followed.

Chi-square test

Statistical test to compare proportions or categoric data between groups.

Clinical practice guideline (CPG)

A systematically developed,
evidence-based statement designed to standardize the process of care
and to optimize the outcome of care for specified clinical
circumstances.

Cohort study

Prospective observational study design
that involves identifying group(s) having the exposure or condition of
interest, and then following these group(s) forward for the outcome of
interest.

Collinear

In multivariate analysis, two or more independent variables are not independent of one another.

Conditional probability

Probability that an event will occur, given that another event has occurred.

Confidence interval (CI)

Quantifies the precision of
measurement. It is usually reported as 95% CI, which is the range of
values within which there is a 95% probability that the true value lies.

Confounding

A variable having independent
associations with both the dependent and independent variables, thereby
potentially distorting the relation between the variables.

Construct validity

Psychometric property of an outcome instrument that assesses whether the instrument follows accepted hypotheses (constructs).

Content validity

Psychometric property of an outcome
instrument that assesses whether the instrument is representative of
the characteristic being measured (face validity).

Continuous variable

Variable whose values are numeric on a
continuum scale of equal intervals and are capable of having fractions
(i.e., interval, ratio, numerical, and quantitative data).

Controlling for

Adjusting confounding variables in the design or analysis of a study in order to minimize confounding.

Correlation

A measure of the relation or strength of association between two variables.

Cost-benefit analysis

Economic evaluation of the relation of
financial costs to benefits. Both are measured in monetary units. The
result is reported as a ratio.

Cost-effectiveness analysis

Assesses net costs and clinical outcome. It is reported as a ratio of cost per clinical outcome.

Cost-identification analysis

Assesses only net and component costs of an intervention. It is reported in monetary units.

Cost-utility analysis

Assesses net costs of intervention and
patient-oriented utility of outcomes. It is reported frequently as cost
per quality-adjusted life-year (QALY).

Covariate

An explanatory or confounding variable in a research study.

Criterion validity

Psychometric property of an outcome instrument that assesses its relation to an accepted, “gold-standard” instrument.

Crossover study

Prospective experimental study design
that involves the allocation of two or more experimental treatments,
one after the other in a specified or random order to the same group of
patients.

Cross-sectional study

Observational study design that assesses a defined population at a single point in time for both exposure and outcome (survey).

Decision analysis

Application of explicit, quantitative
methods that analyze the probability and utility of outcomes in order
to analyze a decision under conditions of uncertainty.

Dependent variable

Outcome or response variable.

P.109

Descriptive statistics

Statistics such as mean, standard deviation, proportion, or rate that are used for describing a set of data.

Discrete scale

Scale used to measure variables that have integer values.

Distribution

Values and frequency of a variable (e.g., Gaussian, binomial, and skewed).

Effect size

The magnitude of a difference that is
considered to be clinically meaningful. It is used in power analysis to
determine the required sample size.

Evidence-based medicine (EBM)

Conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients.

Experimental study

Study design in which treatment is allocated (i.e., trial).

Factor analysis

Statistical method for analyzing relations among a set of variables to determine underlying dimensions.

Failure

Generic term used for an event.

Fisher exact test

Statistical test used for comparing proportions in studies with small sample sizes.

Hypothesis

A statement that will be accepted or rejected on the basis of the evidence in a study.

Incidence

Proportion of new cases of a specific condition in the population at risk during a specified time interval.

Independent events

Events whose occurrence have no effect on the probability of one another.

Independent variable

Variable that is associated with the
outcome of interest, and that contributes information about the outcome
in addition to that provided by other variables considered
simultaneously.

Intention-to-treat analysis

Method of analysis in randomized
clinical trials in which all patients randomly assigned to a treatment
group are analyzed in that treatment group, whether or not they have
received that treatment or have completed the study.

Interaction

Relation between two independent variables such that they have a different effect on the dependent variable.

Internal consistency

Psychometric property of an outcome instrument that assesses the degree to which individual items are related to each other.

Interobserver reliability

Repeatability of measurements made by two observers.

Intraobserver reliability

Repeatability of measurements made by one observer at two different points in time.

Kaplan-Meier (KM) method

Statistical method in survivorship analysis to estimate survival rates at different times.

Kappa statistic

Statistic used to measure interobserver and intraobserver reliability.

Likelihood ratio (LR)

Likelihood that a given test result
would be expected in a patient with a particular condition compared to
a patient without the condition. It is the ratio of true-positive rate
to false-positive rate.

Log-rank test

Statistic used to compare two survival curves with censored observations.

Longitudinal study

Follows the same patients over multiple points in time.

Matching

Process of making two groups homogeneous for possible confounding factors.

Mean

Measure of central tendency. Sum of values divided by number in sample.

Median

Measure of central tendency. Middle observation (fiftieth percentile).

Meta-analysis

An evidence-based systematic review
that uses quantitative methods to combine the results of several
independent studies to produce statistical summaries.

Mode

Measure of central tendency. The most frequent value.

Multiple comparisons

Pairwise group comparisons involving more than one P value.

Multivariate analysis

Analysis of a set of explanatory
variables with respect to a single outcome, or analysis of several
outcome variables simultaneously with respect to explanatory variables.

Negative predictive value (NPV)

Probability of not having the disease, given a negative diagnostic test. Requires an estimate of prevalence.

Nominal data

Data that are classified into categories with no inherent order.

Nonparametric methods

Statistical tests making no assumption about the distribution of data.

Null hypothesis

Default testing hypothesis assuming no difference between groups.

Number needed to treat (NNT)

Number of patients needed to be treated in order to achieve one additional favorable outcome.

Observational study

Study design in which treatment is not allocated.

Odds

Probability that event will occur divided by probability that event will not occur.

Odds ratio (OR)

Ratio of the odds of having the
condition/outcome in the experimental group to the odds of having the
condition/outcome in the control group (case-control study).

One-tailed test

Test in which the alternative hypothesis specifies a deviation from the null hypothesis in one direction only.

Ordinal variable

Variable that has an underlying order. Numbers used are not to scale.

P value

Probability of type I (α) error. If the P value is small, then it is unlikely that the results observed are due to chance.

P.110

Paired t test

Statistical test used for comparing the difference or change in a continuous variable for paired samples.

Placebo

Inactive substance used to reduce bias by simulating the treatment under investigation.

Positive predictive value (PPV)

Probability of having the disease given a positive diagnostic test. Requires an estimate of prevalence.

Power

Probability of finding a significant
association when it truly exists [1 -probability of type II (β) error].
By convention, power of 80% or greater is considered sufficient.

Prevalence

Proportion of individuals with a disease or characteristic in the study population of interest.

Probability

A number, between 0 and 1, indicating how likely an event is to occur.

Prospective study

Direction of inquiry is forward from cohort. Events transpire after study onset.

Random sample

A sample of subjects from the population such that each has an equal chance of being selected.

Randomized clinical trial (RCT)

Prospective experimental study design
that randomly allocates eligible patients to experimental versus
control groups or different treatment groups.

Receiver-operating characteristic (ROC) curve

Graph showing a test’s performance as the relation between the true-positive rate and the false-positive rate.

Regression

Statistical technique for determining the relation among a set of variables.

Relative risk (RR)

Ratio of incidence of disease or outcome in exposed versus unexposed cohorts (cohort study).

Relative risk reduction (RRR)

Proportional reduction in adverse event rates between experimental and control groups in a trial.

Reliability

Measure of reproducibility of a measurement.

Retrospective study

Direction of inquiry is backwards from cases. Events transpired before study onset.

Robust

A statistical method in which the test statistic is not affected by violation of underlying assumptions.

Sample

Subset of the population.

Selection bias

Systematic error in sampling the population.

Sensitivity

Proportion of patients who have the outcome that are classified as positive.

Sensitivity analysis

Method in decision analysis used to
determine how varying different components of a decision tree or model
change the conclusions.

Skewness

Statistical measure of the asymmetry of the distribution of values for a variable.

Specificity

Proportion of patients without the outcome who are classified as negative.

Standard deviation

Descriptive statistic representing the deviation of individual values from the mean.

Student t test

Statistical test for comparison of means between two independent groups.

Survivorship analysis

Statistical method of analyzing time-to-event data.

Systematic review (SR)

An evidence-based summary of the
medical literature that uses explicit methods to perform a thorough
literature search and critical appraisal of studies.

Test–retest reliability

Psychometric property of consistency of an instrument at different points in time without a change in status.

Two-tailed test

Test in which the alternative hypothesis specifies a deviation from the null hypothesis in either direction.

Univariate analysis

Analysis of the relation of a single independent and a single dependent variable (bivariate analysis).

Utility

Measure of patient desirability or preference for various states of health and illness.

Validity

Degree to which a questionnaire or instrument measures what it is intended to measure.

Wilcoxon rank sum test

Nonparametric version of the Student t test. Also known as the Mann-Whitney U test.

Wilcoxon signed rank test

Nonparametric version of the paired t test for comparing medians between matched groups.

REFERENCES

Study Design

1. Hennekens CH, Buring JE. Epidemiology in medicine. Boston: Little Brown, 1987.

2. Oxman
AD, Sackett DL, Guyatt GH. Users’ guides to the medical literature. I.
How to get started. The evidence-based medicine working group. JAMA 1993;270(17):2093–2095.

3. Davidoff F, Haynes B, Sackett D, et al. Evidence based medicine. BMJ 1995;310:1085–1086.

4. Sackett DL, Rosenberg WM. On the need for evidence-based medicine. J Public Health Med 1995;17:330–334.

5. Sackett DL, Rosenberg WM, Gray JA, et al. Evidence based medicine: what it is and what it isn’t. BMJ 1996;312:71–72.

6. Strauss SE, Sackett DL. Using research findings in clinical practice. BMJ 1998;317:339–342.

7. Feinstein AR, Spitz H. The epidemiology of cancer therapy. I. Clinical problems of statistical surveys. Arch Intern Med 1969; 123:171–186.

8. Feinstein
AR, Pritchett JA, Schimpff CR. The epidemiology of cancer therapy. II.
The clinical course: data, decisions, and temporal demarcations. Arch Intern Med 1969;123:323–344.

9. Feinstein AR, Pritchett JA, Schimpff CR. The epidemiology of cancer therapy. III. The management of imperfect data. Arch Intern Med 1969;123:448–461.

10. Wright
JG, Feinstein AR. A comparative contrast of clinimetric and
psychometric methods for constructing indexes and rating scales. J Clin Epidemiol 1992;45:1201–1218.

11. Wennberg J, Gittelsohn A. Small area variations in health care delivery. Science 1973;182(117):1102–1108.

12. Wennberg J, Gittelsohn A. Variations in medical care among small areas. Sci Am 1982;246(4):120–134.

13. Wennberg JE. Dealing with medical practice variations: a proposal for action. Health Aff (Millwood) 1984;3(2):6–32.

14. Wennberg JE. Outcomes research: the art of making the right decision. Internist 1990;31(7):26–28.

P.111

15. Wennberg JE. Practice variations: why all the fuss? Internist 1985;26(4):6–8.

16. Wennberg JE, Bunker JP, Barnes B. The need for assessing the outcome of common medical practices. Annu Rev Public Health 1980;1:277–295.

17. Chassin
MR. Does inappropriate use explain geographic variations in the use of
health care services? A study of three procedures [see Comments]. JAMA 1987;258(18):2533–2537.

18. Kahn KL, Kosecoff J, Chassin MR, et al. Measuring the clinical appropriateness of the use of a procedure. Can we do it? Med Care 1988;26(4):415–422.

19. Park
RE, Fink A, Brook RH, et al. Physician ratings of appropriate
indications for three procedures: theoretical indications vs
indications used in practice. Am J Public Health 1989;79(4): 445–447.

20. Millenson ML. Demanding medical excellence. Chicago: University of Chicago Press, 1997.

21. Katz J. The Nuremberg Code and the Nuremberg Trial. JAMA 1996;276:1662–1666.

22. World
Medical Organization. Declaration of Helsinki: recommendations guiding
physicians in biomedical research involving human subjects. JAMA 1997;277:925–926.

Diagnostic Performance

23. Kocher
MS, DiCanzio J, Zurakowski D, et al. Diagnostic performance of clinical
examination and selective magnetic resonance imaging in the evaluation
of intra-articular knee disorders in children and adolescents. Am J Sports Med 2001;29:292–296.

24. Kocher MS. Ultrasonographic screening for developmental dysplasia of the hip: an epidemiologic analysis. Part I. Am J Orthop 2000;29:929–933.

25. Kocher MS. Ultrasonographic screening for developmental dysplasia of the hip: an epidemiologic analysis. Part II. Am J Orthop 2001;30:19–24.

26. Baron JA. Uncertainty in Bayes. Med Dec Making 1994;14:46–51.

27. Kocher
MS, Zurakowski D, Kasser JR. Differentiating between septic arthritis
and transient synovitis of the hip in children: an evidence-based
clinical prediction algorithm. J Bone Joint Surg 1999;81A:1662–1670.

28. Hanley JA, McNeil BJ. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve. Radiology 1982; 143:29–36.

Measures of Effect

29. Kocher
MS, Sterett WI, Briggs KK, et al. Effect of functional bracing on
subsequent knee injury in ACL-deficient professional skiers. J Knee Surg 2003;16:87–92.

Outcomes Assessment

30. Kane RL: Outcome measures. In: Kane R, ed. Understanding health care outcomes research. Gaithersburg, MD: Aspen Publishers, 1997:17–18.

31. Patrick DL, Deyo RA. Generic and disease-specific measures in assessing health status and quality of life. Med Care 1989;27: 217–232.

32. Stewart AL, Ware JE, eds. Measuring functioning and well-being. Durham: Duke University Press, 1992.

33. Carr-Hill RA. The measurement of patient satisfaction. J Public Health Med 1992;14(3):236–249.

34. Strasser S, Aharony L, Greenberger D. The patient satisfaction process: moving toward a comprehensive model. Med Care Rev 1993;50(2):219–248.

35. Ware JE Jr, Davies-Avery A, Stewart AL. The measurement and meaning of patient satisfaction. Health Med Care Serv Rev 1978; 1(1):1, 3–15.

36. Kocher
MS, Steadman JR, Zurakowski D, et al. Determinants of patient
satisfaction after anterior cruciate ligament reconstruction. J Bone Joint Surg 2002;84-A:1560–1572.

Evidence-based Medicine

37. Sackett DL, Rosenberg WMC, Gray JAM, et al. Evidence-based medicine: what it is and what it isn’t. BMJ 1996;312:71–72.

38. Evidence-Based Medicine Working Group. Evidence-based medicine: a new approach to teaching the practice of medicine. JAMA 1992;268:2420–2425.

39. Sackett DL, Strauss SE, Richardson WS, et al. Evidence-based medicine: how to practice and teach EBM. Edinburgh: Churchill-Livingstone, 2000.

40. Bhandari
M, Devereaux PJ, Swiontkowski MF, et al. 3rd internal fixation compared
with arthroplasty for displaced fractures of the femoral neck. A
meta-analysis. J Bone Joint Surg 2003;85A: 1673–1681.

41. Kocher
MS, Mandiga R, Murphy J, et al. A clinical practice guideline for
septic arthritis in children: efficacy on process and outcome for
septic arthritis of the hip. J Bone Joint Surg 2003;85A: 994–999.

42. Birkmeyer JD, Welch HG. A reader’s guide to surgical decision analysis. J Am Coll Surg 1997;184(6):589–595.

43. Krahn
MD, Naglie G, Naimark D, et al. Primer on medical decision analysis:
part 4—analyzing the model and interpreting the results [see Comments].
Med Decis Making 1997;17(2):142–151.

44. Pauker SG, Kassirer JP. Decision analysis. N Engl J Med 1987; 316(5):250–258.

45. Kocher
MS, Bishop J, Luke A, et al. Operative vs nonoperative management of
acute achilles tendon ruptures: expected-value decision analysis. Am J Sports Med 2002;30:783–790.

46. Detsky AS, Naglie IG. A clinician’s guide to cost-effectiveness analysis. Ann Intern Med 1990;113:147–154.

47. Weinstein MC, Stason WB. Foundations of cost-effectiveness analysis for health and medical practices. N Engl J Med 1977;13: 716–721.

Biostatistics

48. Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc 1958;53:457–481.

49. Kalbfleisch JD, Prentice RL. The statistical analysis of failure time data. New York: John Wiley & Sons, 1980:10–14.

50. Mantel N. Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 1996;50:163–170.

51. Scully
SP, Ghert MA, Zurakowski D, et al. Pathologic fracture in osteosarcoma:
prognostic importance and treatment implications. J Bone Joint Surg 2002;84A:49–57.

Case Example: Scoliosis

52. Adams F. The genuine works of Hippocrates. Baltimore: William & Wilkins, 1939:237.

53. Peltier LF. Orthopedics: a history and iconography. San Francisco: Norman Publishing, 1993:195–222.

54. Hibbs
RA, Risser JC, Ferguson AB. Scoliosis treated by the fusion operation:
an end-result study of three hundred and sixty cases. J Bone Joint Surg 1931;13A:91–104.

55. Harrington PR. Treatment of scoliosis: correction and internal fixation by spine instrumentation. J Bone Joint Surg 1962;44A: 591–610.

56. Research Committee of the American Orthopaedic Association. End-result study of the treatment of idiopathic scoliosis. J Bone Joint Surg 1941;23A:962–977.

57. Rogala EJ, Drummond DS, Gurr J. Scoliosis: incidence and natural history. J Bone Joint Surg 1978;60A:173–176.

58. Lonstein JE. Screening for spinal deformities in Minnesota schools. Clin Orthop 1977;126:33–42.

59. Lonstein JE, Carlson JM. The prediction of curve progression in untreated idiopathic scoliosis during growth. J Bone Joint Surg 1984;66A:1061–1071.

60. Blount WP, Schmidt AC, Keever ED, et al. The Milwaukee brace in the operative treatment of scoliosis. J Bone Joint Surg 1958; 40A:511–525.

61. Emans JB, Kaelin A, Bancel P, et al. The Boston bracing system for idiopathic scoliosis. Follow-up results in 295 patients. Spine 1986;11:792–801.

62. Ponseti IV, Friedman B. Prognosis in idiopathic scoliosis. J Bone Joint Surg 1950;32A:381–395.

63. Ponseti IV, Friedman B. Changes in scoliotic spines after fusion. J Bone Joint Surg 1950;32A:751–766.

64. Weinstein SL, Zavala DC, Ponseti IV. Idiopathic scoliosis: long-term follow-up and prognosis in untreated patients. J Bone Joint Surg 1981;63A:702–712.

65. Weinstein SL, Ponseti IV. Curve progression in idiopathic scoliosis. J Bone Joint Surg 1983;65A:447–455.

66. Weinstein
SL, Dolan LA, Spratt KF, et al. Health and function of patients with
untreated idiopathic scoliosis: a 50-year natural history study. JAMA 2003;289:559–567.

67. Noonan KJ, Dolan LA, Jacobson WC, et al. Long-term psychosocial characteristics of patients treated for idiopathic scoliosis. J Pediatr Orthop 1997;17:712–717.