Univariate analysis is the simplest way to analyze data. “Uni” means “one”, that is, your data has only one variable. It does not deal with causes or relationships (unlike regression) and its main objective is to describe; it takes the data, summarizes it, and finds patterns in it.
There are three categories of data analysis: univariate analysis, bivariate analysis, and multivariate analysis.
Univariate analysis is the simplest form of data analysis, in which the analyzed data contains only one variable. Being a single variable, it does not deal with causes or relationships. The main objective of univariate analysis is to describe the data and find the patterns that exist in it.
The variable can be thought of as a category in which data enters. An example of a variable in the univariate analysis could be “age”. Another could be “height.” Univariate analysis would not examine these two variables at the same time, nor would it examine the relationship between them.
Some ways to describe the patterns found in the univariate data include observing mean, fashion, median, range, variance, maximum, minimum, quartiles, and standard deviation. In addition, some ways to display univariate data are frequency distribution tables bar charts histograms, frequency polygons, and pie charts.
Bivariate analysis is used to find out if there is a relationship between two different variables. Something as simple as creating a scatter plot by plotting one variable against another in a Cartesian plane (think X and Y axes) can sometimes give you an idea of what the data is trying to tell you. If the data appears to fit a line or curve, then there is a relationship or correlation between the two variables. For example, you can choose to represent caloric intake versus weight.
Multivariate analysis is the analysis of three or more variables. There are many ways to perform multivariate analysis depending on your goals. Some of these methods are
Analysis of canonical correlation, clusters, correspondence, multiple correspondence, Factors and Generalized Procrustean
Multiple regression analysis
Partial least squares regression
Principal Component Analysis / Regression / PARAFAC
What is a variable in univariate analysis?
According to Kotz, S.; et al. (2006), a variable in univariate analysis is simply a condition or subset in which the data enter. You can think of it as a “category.” For example, the analysis could look at an “age” variable or it could look at “height” or “weight.” However, it does not look at more than one variable at a time, otherwise it becomes a bivariate analysis (or in the case of 3 or more variables it would be called multivariate analysis).
A variable is any feature that can be observed or measured in a subject. In clinical studies, a sample of subjects is collected and some variables of interest are considered. Univariate descriptive analysis of a single variable aims to describe the distribution of the variable in a sample and is the first important step of any study.
Authors should identify the type and number of variables examined, as well as the missing data for each variable.
Variables can be categorical or numerical.
Categorical or qualitative data can be binary, nominal or ordinal. Binary variables are characterized by having only two possible categories, for example male/female, dead/alive.
When there are more than two categories/classes, it is important to distinguish between nominal variables, such as blood type, and ordinal variables, such as the stage of the disease.
Categorical data should be presented not only giving percentages for each class, but also absolute frequencies.
Numerical or quantitative data can be roughly divided into discrete or continuous. Discrete variables arise mainly from counts, such as the number of words in a sentence, the number of components of a family, while continuous variables arise mainly from measurements, such as height, blood pressure or tumor size. These variables are continuous, since, in principle, any value can be taken (in the permissible range of measurement), while discrete variables can only take certain numerical values.
Limitations in Continuous Variables
In the case of continuous variables, the only limitation comes from the accuracy of the measuring instrument. Discrete variables are sometimes treated as continuous, when the number of possible values is very large. Numerical variables can be transformed into categorical variables by grouping values into two or more categories to simplify the understanding of the results (but not the analysis in general). Categorization of numerical variables results in information loss, especially with two groups, and should be done with caution.
Authors should always specify how the categorization was obtained, in particular how the choice of cut-off points was made, whether on the basis of previous analyses or arbitrarily by the authors (using the median and quartiles, for example). In the absence of prior analysis, theoretical or clinical arguments should justify categorization to avoid bias and obtain reliable results (1).
Researchers should avoid arbitrary cut-off points and should prefer categorization into at least three groups avoiding dichotomization.
Frequency distribution and central trend
A variable can be described by its frequency distribution, which reports the absolute (or relative to total) number of times a specific value/class of a variable is observed in the sample. To do this, continuous variables must be divided into classes. In the case of ordered nominal variables and numerical variables, the accumulated frequencies can also be calculated. Instead of tables, graphs can be used to describe distributions.
Pie charts, in which each slice represents the proportion of observations in each category, are useful for nominal (unordered) data, while bar charts can be used for ordinal categorical data or for discrete data. Histograms should be used for continuous data.
Another useful possibility is the box and whisker diagram, which is composed of a box representing the upper and lower quartiles, a center line indicating the median, while the whiskers represent the extreme centiles, and the extreme values are displayed above and below the whiskers.
Due to space limitations, tables that inform the summary values of each distribution are usually used to describe the variables considered in the study. Before summarizing the distribution with few numbers, it is always necessary to observe the complete distribution.
Use of the Mean and Standard Deviation
If the shape of the distribution is approximately symmetric (as in the case of the Gaussian distribution), the mean and standard deviation (SD) can be used, reporting the results as mean (SD), and avoiding the ±. If the shape of the distribution is biased, it is better to use the median and quartiles. A general recommendation might be to report, in all cases, the mean, median, ED and quartiles. The mean, median and fashion are very similar in the case of symmetric distributions. In the case of biased distributions, the median is less influenced by extreme observations.
Another measure of summary is fashion, which is the most frequent observation. This is rarely useful for numerical variables, while it is the only measure to be used with categorical variables. When describing categorical variables in tables, not only the percentages of each class, but also their absolute frequencies, should always be indicated.
Do not confuse SD with standard error (SE).
The SE is a measure of the dispersion of sample means around the population mean and is used for inferential (non-descriptive) purposes. The SE is the relationship between the ED and the square root of the sample size (n) (2).
ED is especially useful when the distribution is approximately Gaussian, since in the Gaussian case about 95% of the observations are included within two SDs of the mean (3).
The general rule is to present summary statistics to no more than one decimal place than raw data (4). In the case of percentages, it is often enough to approximate one decimal place. Rounding should be done only in the final report, not during analysis, to maintain accuracy and not lose information.
According to a commonly used rule, excess digits are removed if the first in excess is less than five. In case the first digit in excess is greater than or equal to five, the last not in excess is increased by one. Note that the output of computers always contains spurious results that must be rounded according to the original accuracy of the measurements.
Data on the time elapsed until the event
In many studies, the time to the occurrence of an event is of interest. The censored data refer to the subjects included in the analysis but for whom the event of interest has not yet been observed when the study is closed (3). For example, in survival studies, censored data include both patients who are still alive at the end of follow-up and patients lost during follow-up.
When reporting the number of events, it is advisable to avoid calculating the percentage with respect to the total number of subjects, unless all subjects have been followed for the same time.
The integrity of the follow-up is an indicator of the quality of the study. Therefore, researchers should report the number of subjects lost during follow-up, in addition to the follow-up interval (minimum and maximum). The Kaplan-Meier method is suitable for describing the distribution of this variable by correctly taking into account follow-up time and censored observations.
Authors should report graphically the number of subjects at risk. In addition, they must indicate the censorship times and confidence intervals, as well as the software used to perform the analyses.
You could have more than one variable in the graph above. For example, you could add the variable “Location” or “Age” or some other, and make a separate column for location or age. In that case, you would have bivariate data because you would have two variables.
How does Univariate Analysis work?
Univariate analysis works by examining the effects of a singular variable on a dataset. For example, a frequency distribution table is a form of univariate analysis, since frequency is the only variable that is measured. Alternative variables can be age, height, weight, etc., but it is important to note that as soon as a secondary variable is introduced it becomes a bivariate analysis. With three or more variables, it becomes a multivariate analysis.
Univariate analysis is a common method of understanding data. Another common example of univariate analysis is the mean of a population’s distribution. Tables, graphs, polygons, and histograms are popular methods for showing the univariate analysis of a specific variable (e.g., mean, median, fashion, standard variation, range, etc.).
Why the univariate statistic?
According to Everitt and Skrondal (2010), univariate analysis explores each variable in a dataset separately. Examines the range of values as well as the central trend of values. Describes the pattern of response to the variable. Describes each variable separately.
Descriptive statistics describes and summarizes the data. Univariate descriptive statistics describe the individual variables.
How to analyze a variable
Get an impression of the raw data of all variables. The raw data resembles an array, with the names of the variables heading the columns, and the information of each case or record displayed in the rows.
Example: Raw data from a study on county worker injuries (top 10 cases)
|Injury Report No.||County Name||Cause of Injury||Severity of Injury|
It’s hard to know what happens to each variable in this dataset. Raw data is difficult to understand, especially when there are a large number of cases or records. Univariate descriptive statistics can summarize large amounts of numerical data and reveal patterns in the raw data. To present the information in a more organized format, start with univariate descriptive statistics for each variable.
For example, the severity of injury variable:
|Severity of Injury|
Get a frequency distribution of the variable data. This is done by identifying the lowest and highest values of the variable, and then putting all the values of the variable in order from lowest to highest. Then count the number of occurrences of each value of the variable. This is a count of how often each value appears in the dataset. For example, for the variable “Severity of injury”, the values range from 2 to 9.
|Severity of Injury||Number of Injuries with this Severity|
Decide whether data should be grouped into classes.
Injury severity indices can be grouped into a few categories or groups. Pooled data typically has 3 to 7 groups. There should be no groups with a frequency of zero (for example, there are no injuries with a severity rating of 7 or 8).
One way to construct groups is to have equal class intervals (e.g., 1-3, 4-6, 7-9). Another way to construct groups is to have about the same number of observations in each group. Remember that class intervals should be mutually exclusive and exhaustive.
|Severity of Injury||Number of Injuries with this Severity|
Cumulative frequency distributions include a third column in the table (this can be done with simple frequency distributions or with pooled data):
|Severity of Injury||Number of Injuries||Cumulative frequency|
A cumulative frequency distribution can answer questions like, how many of the lesions were level 5 or lower? Answer=7
Frequencies can also be presented in the form of percentage distributions and cumulative percentages.
|Severity of Injury||Percent of Injuries||Cumulative percentages|
Univariate descriptive statistics
Some ways to describe the patterns found in the univariate data are the central trend (mean, moda and median) and the dispersion: range, variance, maximum, minimum, quartiles (including the interquartile range) and standard deviation.
You have several options for describing data with univariate data. Click the link to learn more about each type of chart or table:
Frequency distribution tables.
Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.
If this article was to your liking, do not forget to share it on your social networks.
You might also be interested in: Controlled Experiments
Kotz, S.; et al., eds. (2006), Encyclopedia of Statistical Sciences, Wiley.
Everitt, B.S.; Skrondal, A. (2010), The Cambridge Dictionary of Statistics, Cambridge University Press.