Probabilistic Models for Statistical Data Collection

The goal of any science is to discover the structure and dynamics of the phenomena that are its object. This, as shown in statistical models for data collection. Scientists continually try to describe possible structures. They wonder if the data can, taking into account measurement errors, be adequately described in terms of them.

For a long time, several families of structures have been repeated in many fields of science. These structures have become objects of study in their own right. Mainly by statisticians, other specialists in methodology, applied mathematicians and philosophers of logic and science. Methods have been developed to assess the suitability of certain structures to account for certain types of data. For the sake of clarity, we talk about these structures in this article and the analytical methods used for their estimation and evaluation. In practice, however, they are closely intertwined.

Statistical and Mathematical Models

A good part of the mathematical and statistical models try to describe the relationships, both structural and dynamic, that occur between the variables. They are supposed to be representable by numbers. Such models are only applicable in the social and behavioral sciences. This is to the extent that appropriate numerical measurements can be devised for the relevant variables. In many studies, the phenomena in question and the raw data obtained are not intrinsically numerical, but qualitative, such as ethnic group identifications.

The identification numbers used to encode these categories of the questionnaire for computers are nothing more than labels, which could also be letters or colours. A key question is whether there is any natural way to move from the qualitative aspects of that data to a structural representation. It implies that one of the well-understood numerical or geometric models or whether such an attempt would be inherently inappropriate for the data in question. Deciding whether or not concrete empirical data can be represented in certain numerical or more complex structures is rarely straightforward. Strong intuitive biases or a priori assumptions about what can and cannot be done can be misleading.

Adaptation to the Social Sciences

In recent decades there has been a rapid and extensive development and application of analytical methods. They have been adapted to the nature and complexity of social science data. According to Ellenberg (2014), examples of non-numerical modelling are becoming more numerous. In addition, the widespread availability of powerful computers is probably causing a qualitative revolution. This is because it affects not only the ability to calculate numerical solutions to numerical models. It also concerns the elaboration of the consequences of all kinds of structures that do not involve numbers at all.

It is also useful to distinguish between representations of data that are very discrete or categorical by nature (as if a person is male or female) and those that are continuous by nature (such as a person’s height). Of course, there are intermediate cases involving both types of variables. For example, color stimuli that are characterized by discrete tones (red, green) and a continuous luminance measurement.

Probabilistic models lead very naturally to questions of estimation and statistical evaluation of the correspondence between the data and the model. Those that are not probabilistic involve additional problems of treatment and representation of sources of variability that are not explicitly modeled. Today, scientists understand some aspects of structure, such as geometries, and some aspects of randomness. They are embodied in probabilistic models, but they do not yet adequately understand how to put the two together in a single unified model.

Probability models

Some variables in the social and behavioral sciences appear to be more or less continuous. For example, the usefulness of goods, the intensity of sounds or the risk associated with uncertain alternatives. However, many other variables are inherently categorical. This often with only two or a few possible values. For example, whether a person is in school or not, whether they are employed or not, whether they identify with a major political party or a political ideology. And some variables, such as moral attitudes, are often measured in research with survey questions that allow only categorical answers.

Much of the first probability theory was formulated only for continuous variables. Its use with categorical variables was not really justified, and in some cases may have been misleading. Recently, there have been very significant advances in the way categorical variables are explicitly addressed. We first describe several contemporary approaches to models involving categorical variables, followed by those involving continuous representations.

Logarithmic models for categorical variables

Many recent models for the analysis of categorical data of the type typically displayed as counts (cell frequencies) in multidimensional contingency tables are encompassed under the general title of log-linear models. That is, linear models in the natural logarithms of the expected counts in each table cell. These newly developed forms of statistical analysis make it possible to divide the variability due to various sources in the distribution of categorical attributes and to isolate the effects of certain variables or combinations thereof.

According to Livy (2013), the current linear logarithmic models were first developed and used by statisticians and sociologists. They then found wide application in other disciplines of the social and behavioral sciences. When applied, for example, to the analysis of social mobility, these models separate the factors of occupational supply and demand from other factors. They can impede or drive movement up and down the social hierarchy.

Application of these models

With these models, the researchers discovered the surprising fact that occupational mobility patterns are strikingly similar in many nations around the world. This, even among disparate nations like the United States and most of the socialist countries of Eastern Europe. Also from one period of time to another, once the differences in the distribution of occupations are taken into account. Linear logarithmic models and similar models have also made it possible to identify and analyze the systematic differences in mobility between countries and over time. Another example of applications is that psychologists and other professionals have used log-linear models to analyze attitudes and their determinants. Also to link attitudes to behavior. These methods have also been widely disseminated and used in the medical and biological sciences.

Regression models for categorical variables

Models that allow you to explain or predict a variable through others, called regression models, are the workhorses of much of the applied statistics. This is especially true when the dependent variable (explained) is continuous. For a two-value dependent variable, such as live or dead, models and approximate theory and computational methods for an explanatory variable were developed in biometrics about 50 years ago. Nowadays there are computer programs capable of handling many explanatory variables, continuous or categorical. However, even now, the accuracy of the approximate theory on given data is an open question.

Using classical utility theory, economists have developed discrete choice models that turn out to be somewhat related to log-linear and categorical regression models. Models for limited dependent variables, especially those that cannot take values above or below a certain level (such as weeks of unemployment, number of children, and years of schooling) have been used usefully in economics and in some other areas.

Application to Censored Normal Variables

Censored normal variables (called tobits in economics), in which observed values outside certain limits are simply counted, have been used in the study of decisions to continue studying. Further research and development will be needed to fully incorporate information on limited ranges of variables into the main multivariate methodologies. In addition, with respect to conventionally made assumptions about distribution and functional form in discrete response models, some new methods are now being developed that promise to produce reliable inferences. This is without making unrealistic assumptions. Future research in this area promises significant progress.

Models of event stories

Studies of the history of events reveal the sequence of events that respondents experience over a period of time. For example, the moment you get married, have a child or participate in the labour force. Data from the history of events can be used to study educational progress, demographic processes (migration, fertility and mortality), company mergers, labour market behaviour and even riots, strikes and revolutions. As interest in this type of data has grown, many researchers have turned to models that refer to changes in probabilities over time. Especially to describe when and how individuals move between a set of qualitative states.

Much of the advances in models for event history data are based on recent developments in statistics and biostatistics for models of life time, time of failure, and danger. These models allow to analyze the qualitative transitions in a population whose members suffer a partially random organic deterioration, a mechanical wear or other risks over time.

The problem of repeated transitions

With the increasing complexity of the event history data currently being collected, and the extension of event history databases to very long periods of time, new problems arise. They cannot be dealt with effectively by older types of analysis.

Problems include repeated transitions, such as between unemployment and employment or marriage and divorce. There may also be more than one time variable (such as biological age, calendar time, duration at a stage, and time of exposure to a specific condition). On the other hand, there are the latent variables (variables that are explicitly modeled even if they are not observed). Also, gaps in the data and the wear and tear of the sample that is not randomly distributed among the categories, as well as the difficulties of the respondents to remember the exact moment of the events.

Models for the measurement of multiple items

For a variety of reasons, researchers often use multiple measures (or multiple indicators) to represent theoretical concepts. Sociologists, for example, according to Carlisle (2017), often rely on two or more variables (such as occupation and education) to measure an individual’s socioeconomic position. Education psychologists often measure a student’s ability with multiple test items. Although the basic observations are categorical, in several applications they are interpreted as a partition of something continuous. For example, in the theory of tests, measures of the difficulty of the items and the capacity of the respondent are thought of as continuous variables, possibly of a multidimensional nature.

The classical theory of tests and the new theories of response to the item in psychometrics deal with the extraction of information from multiple measures. Tests, which are an important source of data in education and other areas, result in millions of test items stored in archives each year. This is for purposes ranging from admission to university to job training programs for industry. One of the goals of research on this test data is to be able to make comparisons between individuals or groups, even when different test elements are used.

Response Techniques

Although the information collected from each respondent is intentionally incomplete to make the tests short and simple, item response techniques allow researchers to reconstitute the fragments into an accurate picture of the group’s overall competencies. These new methods provide better theoretical management of individual differences, and are expected to be extremely important in the development and use of tests. For example, they have been used in attempts to equate different forms of a test administered in successive waves over a year, a procedure that is made necessary in large-scale testing programs by legislation requiring the disclosure of test score keys at the time the results are given.

Practical Example

An example of the use of item-response theory in a major research effort is the National Assessment of Educational Progress (NAEP). The goal of this project is to provide accurate and nationally representative information on the average (and not individual) competence of American children in a wide variety of academic subjects as they progress through elementary and secondary school. This approach is an improvement on the use of trend data from university entrance exams. NAEP’s estimates of academic achievement (by general characteristics such as age, grade, region, ethnicity, etc.) are not distorted by the self-selected nature of students seeking admission to university, graduate, and professional programs.

Item-response theory also forms the basis of many new psychometric instruments. These are known as computerized adaptive tests, which are currently being implemented by the U.S. military services. They are in further development in many testing organizations. In adaptive tests, a computer program selects the items for each test based on the success of the test with the previous items.

Usually, each person receives a slightly different set of items and the equivalence of the scale scores is established using item-response theory. Adaptive testing can greatly reduce the number of items needed to achieve a certain level of measurement accuracy.

Nonlinear and non-additive models

Virtually all statistical models currently in use impose an assumption of linearity or additiveness of some kind. Sometimes after a nonlinear transformation of the variables. Imposing these forms on relationships that do not, in fact, possess them can lead to false descriptions and spurious effects. Unsuspecting users, especially of computer software packages, can be easily tricked. But more realistic nonlinear, non-additive multivariate models are becoming available. Extensive use with empirical data is likely to require many changes and improvements to such models. It may also stimulate very different approaches to nonlinear multivariate analysis over the next decade.

Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.

If this article was to your liking, do not forget to share it on your social networks.

Bibliographic References

Livio M (2013) Brilliant Blunders: From Darwin to Einstein—Colossal Mistakes by Great Scientists that Changed our Understanding of Life and the Universe (Simon & Schuster, New York), 1st Simon & Schuster hardcover ed, p 341.

Ellenberg J (2014) How Not to Be Wrong: The Power of Mathematical Thinking (Penguin, New York).

Carlisle JB (2017) Data fabrication and other reasons for non-random sampling in 5087 randomised, controlled trials in anaesthetic and general medical journals. Anaesthesia 72:944–952.

You may also be interested in: Geometric and Algebraic Models for Data Collection

Probabilistic Models for Statistical Data Collection

Probabilistic Models for Statistical Data Collection

Statistical and Mathematical Models

Adaptation to the Social Sciences

Probability models