Beginner’s Guide to Evaluating Scientific Studies
While writing our report on puberty suppression, Lesbians United consulted a vast body of scientific research. Here’s how we evaluated the quality of the studies we cited, and how you can do the same.
Four Types of Studies
Puberty Suppression: Medicine or Malpractice? cites four types of scientific study:
Clinical studies, in which researchers administer an experimental treatment to their subjects, and record the results.
Retrospective studies, in which researchers examine existing medical records.
Case studies, or reports on the case of one patient or a handful of patients.
Voluntary-response surveys, in which participants volunteer to answer survey questions.
If set up well, both clinical and retrospective studies can tell us something about the likelihood of particular outcomes—like how effective the experimental drug is, or how prevalent its side effects are. Case studies and voluntary-response surveys can’t do this. Instead, they provide anecdotal evidence—evidence that some people might respond a certain way to the drug in question, or that a certain side effect might sometimes occur.
Case studies do often represent the first published evidence of a widespread problem—one doctor notices something unexpected and writes it up, several other doctors follow suit, and then researchers conduct larger studies to find out how prevalent the problem is.
Voluntary-response surveys are considered the least reliable kind of study, since their voluntary nature makes them prone to bias, and the responses they receive are always subjective.
The Good and the Gold Standard
The gold standard for clinical study design is the randomized double-blind placebo control study. In this kind of study, a “treatment group” receives the experimental drug, and a “control group” receives a placebo (a sugar pill or saline injection). Neither the researchers nor the subjects know who received the real drug until all data has been collected. This method counters researcher bias and prevents emotional factors (optimism if you know you’re receiving the real drug, pessimism if you know you’re not receiving treatment) from affecting patient outcomes.
Other factors that make a scientific study more trustworthy include:
a large sample size. The more subjects a study includes, the more likely it is to accurately predict outcomes for the general population. This is one reason case studies are not predictive—they only study one or a few subjects at a time. This is also why retrospective studies are often so valuable. By examining whole databases of medical records, they can easily have sample sizes over 50,000!
a control group. Both clinical and retrospective studies require a control group to be even remotely useful. Otherwise, we can’t know whether outcomes are actually a result of the experimental treatment, or due to some other factor, like natural healing or the progression of time. For best results, the control group should be as similar as possible (in age, geographic location, and medical history) to the treatment group.
The Bad: Red Flags
Red flags in clinical and retrospective studies include small sample size, lack of control group, and:
large loss to follow-up. Loss to follow-up means that some subjects dropped out of the study between the first and last data measurement. While some loss to follow-up is inevitable in large studies, anything over 15% could significantly skew the data.
poor exclusion criteria. Exclusion criteria means the types of subjects that are excluded from participating in a study. Watch out for studies whose exclusion criteria seem self-contradictory (like mentally ill subjects being excluded from a study on mental illness) or overly strict (like a study that only includes subjects who have had a specific kind of biopsy, when biopsy isn’t necessary to diagnose the disease being studied).
Red flags in any type of study include:
politically-biased language. Scientific studies usually conclude by suggesting directions for future research—not by attacking specific legislation or by calling for equality, justice, or access.
evidence at odds with conclusion. If the raw data shows one thing, and the researchers conclude something completely different, try to figure out how the researchers arrived at that conclusion. There are sometimes good reasons for the discrepancy, but it may also indicate that researchers “fudged” their analysis.
biased survey questions. Look carefully at the wording of the questions in any study that relies wholly or partly on a survey.
The Ugly (or, “How did this get through peer review?”)
In our research for Puberty Suppression: Medicine or Malpractice?, we came across a few studies that we honestly couldn’t believe got published. Shout-out to the following hot unscientific messes, and the journals that published them:
Olson-Kennedy et al. 2019. Johanna Olson-Kennedy runs a lucrative clinic that prescribes “puberty blockers” to kids, so it’s no surprise that the study she spearheaded concludes that these drugs are great for kids. How did the authors arrive at this conclusion? By not bothering to include a control group, losing 19% of their subjects to follow-up, and openly stating that the purpose of the study is to “substantially expand treatment across the country.” Oh, and the exclusion criteria include “presence of serious psychiatric symptoms … or appearing visibly distraught.” (We’re supposed to believe that “gender dysphoria” is a serious psychiatric condition that causes extreme distress, right?) Both Olson-Kennedy and JMIR Research Protocols have a lot to answer for.
Turban, Beckwith, Reisner, and Keuroghlian 2020 and Turban, King, Carswell, and Keuroghlian 2020. Two studies headed up by “gender doctor” Jack Turban, one published in JAMA Psychiatry and one in Pediatrics. Both take their data from a politically-motivated voluntary response survey—a fact which neither paper acknowledges.
Achille et al. 2020. This study lost a whopping 45% of its subjects to follow-up. Instead of acknowledging that they had a problem, the researchers tried to bury the evidence by retroactively removing all the data from the subjects who were lost to follow-up. Somehow, the International Journal of Pediatric Endocrinology decided that it was a good idea to publish this paper anyway.
Tordoff et al. 2022. In addition to losing 37.5% of its subjects to follow-up, this study takes the cake for politically biased language, targeting “antitransgender legislation” and begging “for medical systems and insurance providers to decrease barriers and expand access to gender-affirming care.” By publishing this paper, JAMA isn’t furthering scientific research; it’s signaling its compliance with a political ideology.
For more information on the effects of puberty suppression, check out Lesbians United’s free publication Puberty Suppression: Medicine or Malpractice?, available now on our website.
The way you've presented evaluation criteria in this essay is probably the most accessible and clearly-presented checklist I've yet seen.
Outstanding article, one that provides information useful in evaluating any sort of study.
Thank you for publishing it.
Thanks very much. Well written and most useful.