Errors in The Collection of Statistical Data (I)

Mar 19, 2021 | Methodology

Some aspects of science, taken at the broadest level, are universal in empirical research. These include data collection, analysis and presentation. In each of these aspects, errors can and do occur.

We first discussed the importance of focusing on statistical and data errors to continuously improve the practice of science. Below, we describe the underlying themes of the types of errors and postulate the factors that contribute to them. To do this, we describe a number of cases of statistical errors and relatively serious data. Surveys of some types of errors were reviewed to better characterize magnitude, frequency and trends. Once these errors are examined, we analyze the consequences of specific errors or classes of errors.

Finally, taking into account the extracted topics, we discuss methodological, cultural and system-level approaches to reducing the frequency of commonly observed errors. These approaches will plausibly contribute to the self-critical, self-correcting and constantly evolving practice of science and, ultimately, to the advancement of knowledge.

In common life, retract a mistake, even at first. To this we can add that disappointment and opposition inflame the minds of men and further enchants them to their mistakes.

Why it’s important to focus on mistakes

According to Arnup et al (2016), identifying and correcting errors is essential for science. This gives rise to the maxim that science is self-correcting. The corollary is that if we do not identify and correct mistakes, science cannot pretend to be self-correcting. It is a concept that has been a source of critical discussion. It could be said that mistakes are necessary for scientific advancement. Staying within the limits of established thinking and methods limits the advancement of knowledge.

The history of science is rich in errors. Before Watson and Crick, Linus Pauling published his hypothesis that the structure of DNA was a triple helix. Lord Kelvin was wrong to estimate the age of the Earth by more than an order of magnitude. In the early stages of the discipline of genetics, Francis Galton introduced erroneous mathematical expression for the contributions of different ancestors to traits inherited from an individual. No less erroneous, these errors represent important insights from some of history’s brightest minds working on the frontier between ignorance and knowledge. They propose, test and refine theories.

Background

In principle, well-trained scientists working within their discipline and are aware of the established knowledge of their time should or could have known that they were wrong or lacking in rigor. While the errors mentioned above could only have been identified in retrospect from advances in science, our attention is focused on errors that could often have been avoided prospectively. Demonstrations of human fallibility – rather than human brilliance – have been and will always be present in science.

For example, almost 100 years ago, Horace Secrist, professor and author of a text on statistical methods, drew substantial conclusions about the performance of companies. It was based on patterns that a statistic expert at the time should have understood represented a regression to the mean. More than 80 years ago, the great statistician Student published a critique of a failed experiment in which the time, effort, and expense of studying the effects of milk on the growth of 20,000 children did not result in solid responses. This due to a careless design and execution of the study. These problems are not new to science. Similar mistakes are still being made today. Sometimes they are serious enough to call into question entire studies, and they can occur with a non-trivial frequency.

What do we mean by mistakes?

According to Gøtzsche et al (2007), by mistakes we mean actions or conclusions that are demonstrably and unequivocally incorrect from a logical or epistemological point of view. For example, logical fallacies, mathematical errors, claims not backed up by the data, incorrect statistical procedures, or analysis of a wrong data set. We are not referring to questions of opinion (e.g., whether one measure of anxiety might have been preferable to another) or to ethical issues that are not directly related to the epistemic value of a study (e.g., whether authors had a legitimate right to access data reported in a study).

Finally, by labeling something as a mistake, we declare only its lack of objective correctness, and we make no implication about the intentions of those who make the mistake. Thus, our definition of disabling errors can include fabrication and counterfeiting (two types of misconduct). Since they are defined by intentionality and atrocity, we will not address them specifically. Furthermore, we fully recognize that the categorization of errors requires a degree of subjectivity and is something that others have struggled with.

Types of errors we will consider

According to Tokolahi et al (2016), error types have three characteristics. First, they relate generally to study design, statistical analysis and communication of designs, analytical options and outcomes. Second, we focus on invalidation errors. These involve factual errors or deviate substantially from clearly accepted procedures so that, if corrected, they could alter the conclusions of a work.

Third, we focus on mistakes where there is a reasonable expectation that the scientist should or could have known. Therefore, we do not consider the errors of thought or procedure necessary for the progress of new ideas and theories. Secrist’s errors and those identified by Student could have been avoided thanks to established and contemporary knowledge. While the mistakes of Pauling, Kelvin and Galton predated the knowledge needed to avoid them.

Violations of Scientific Standards

We believe it is important to isolate scientific errors from violations of scientific standards. These violations are not necessarily invalidating errors. However, they can affect confidence in the scientific enterprise or its operation. Some harmful research practices, non-disclosure of conflicts of interest, plagiarism (which falls under “misconduct”) and failure to obtain ethical approval do not affect the truth or veracity of methods or data.

Rather, they affect prestige (authorship), public perception (dissemination), trust among scientists (plagiarism), and public trust in science (ethical approval). Violations of these rules have the potential to skew conclusions in a field. They are therefore important in their own right. It is important to separate discussions about social malbehavior from errors that directly affect methods, data, and conclusions in both primary and secondary analyses.

Underlying themes of errors and their contributing factors

Error Type Topics

Various themes or taxa of errors have been proposed. We have pointed out errors related to measurement, study design, replication, statistical analysis, analytical choices, citation bias, publication bias, interpretation, and the misuse or neglect of simple mathematics. Others have classified the errors by stages of the investigation process. Bouter et al., for example, classified misconceptions in research into four areas: reporting, collaboration, data collection, and study design. Many elements within various themes or taxa overlap. One person’s research misbehavior can be classified as a statistical error of another.

Errors that cause “bad data”

We define bad data as those acquired through collection methods, study designs, or sampling techniques that are erroneous or of sufficiently low quality. In this way its use to address a particular scientific issue is scientifically unjustifiable.

Example 1

In one example, self-declared energy intake has been used to estimate actual energy intake. This method involves asking people to remember their dietary intake in one or more ways, and then deriving an estimate of metabolizable energy intake from these reports.

The method, compared to objective measurements of actual energy intake, turns out to be invalid, not just “limited” or “imperfect.” Measurement errors are large and non-random enough to result in consistent and statistically significant correlations in the opposite direction to the true correlation for some relationships. Moreover, the relationships between errors and other factors are numerous and complex enough to challenge simple corrections. Concerns were raised about this method decades ago, and yet it is still used.

Example 2

Other common examples of erroneous data are the confusion of batch effects with the variables of interest of the study and the misidentification or contamination of cell lines. In the case of confusion or contamination, the data is bad because of a failed design and is often unrecoverable.

Faulty data represents one of the most egregious topics of errors because there is usually no right way to analyze faulty data. Often no scientifically justifiable conclusions can be reached on the original issues of interest. It can also be one of the most difficult errors to classify, because it can depend on information such as the context in which the data is used and whether it is appropriate for a particular purpose.

Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.

If this article was to your liking, do not forget to share it on your social networks.

Bibliographic References

Tokolahi E, Hocking C, Kersten P, Vandal AC (2016) Quality and reporting of cluster randomized controlled trials evaluating occupational therapy interventions: A systematic review. OTJR (Thorofare, NJ) 36:14–24.

Arnup SJ, Forbes AB, Kahan BC, Morgan KE, McKenzie JE (2016) The quality of reporting in cluster randomised crossover trials: Proposal for reporting items and an assessment of reporting quality. Trials 17:575.

Gøtzsche PC, Hróbjartsson A, Marić K, Tendal B (2007) Data extraction errors in meta-analyses that use standardized mean differences. JAMA 298:430–437.

You may also be interested in: Thesis of the UPNA conducts research on venous thromboembolic disease in very young patients and centenarians

Errors in Statistical Data Collection

Errors in The Collection of Statistical Data (I)