Statistical Data Collection

Statistical data is a sequence of observations. They are carried out on a set of objects included in the sample extracted from the population. Statistical data can be presented in two (2) ways.

Non-grouped data

Data that has been arranged in a systematic order is called raw data or non-grouped data.

Grouped data

The data presented in the form of frequency distribution are called pooled data.

Data collection

The first step of any survey (research) is data collection. Data can be collected for the entire population or only for one sample. In most cases, according to Ioannidis (2012), they are collected on the basis of a sample. Data collection is a very difficult job. The enumerator or researcher is the well-trained person who collects the statistical data. Respondents are the people from whom the information is collected.

Data types

There are two types (sources) of data collection. These are the Primary Data and Secondary Data

Primary

Primary data is first-hand information that organizations collect, collect, and publish for some purpose. This is the most original data. They have not undergone any statistical treatment.

Example: Population census reports are primary data because they are collected, compiled and published by the population census organization.

Side

Secondary data is second-hand information that has already been collected by an organization for some purpose. In this regard, it is available for the present study. Secondary data are not pure and have been processed at least once.

Example: An economic study of England is a secondary piece of data because the data has been collected by more than one organization. It can be the Statistical Office, the Board of Revenue, banks, etc.

Primary data collection methods

Primary data are collected using the following methods:

Personal research: The researcher himself conducts the survey and collects the data from it. The data collected in this way is usually accurate and reliable. This method of data collection is only applicable in the case of small research projects.
Through research: Trained researchers are employed to collect the data. These researchers contact people and fill out the questionnaires after requesting the necessary information. Most organizations use this method.
Collection by questionnaire: Researchers obtain data from local representations or agents that are based on their own experience. This method is fast but only gives a rough estimate.
Over the phone: Researchers get information from individuals over the phone. This method is fast and provides accurate information.

Secondary data collection methods

Secondary data is collected using the following methods:

Official: e.g. publications of the Statistics Division, the Ministry of Finance, the Federal Statistical Offices, the Ministries of Food, Agriculture, Industry, Labour, etc.
Semi-official: e.g. the State Bank, the Railway Board, the Central Cotton Committee, the Economic Research Boards, etc.
Publications of trade associations, chambers of commerce, etc.
Technical and commercial magazines and newspapers.
Research organizations such as universities and other institutions.

Difference between primary and secondary data

The difference between primary and secondary data is just a change of hand. The primary data is the first-hand information. They are collected directly from a source. They are the most original and have not been subjected to any kind of statistical treatment. While secondary data are obtained from other sources or agencies. They are not pure in character and have undergone some treatment at least once.

Example: Suppose we are interested in knowing the average age of students in a certain department. We collect the data by two methods: by directly collecting each student’s information or by obtaining their ages from university records. Data collected through direct personal research are called primary data and data obtained from university records are called secondary data.

edition

After collecting the data, either from primary or secondary sources, the next step is to edit it. Editing means the examination of the collected data to discover any errors or mistakes before submitting them. It must be decided in advance what degree of precision is desired and what degree of errors can be tolerated in research. Editing the secondary data is easier than editing the primary data.

New statistical techniques

Internal resampling

One of the great contributions of twentieth-century statistics was to show that a sample was correctly extracted and of sufficient size. This, even if only a small fraction of the population of interest, can result in very good estimates of most population characteristics. When enough is known at first about the characteristic in question – for example, that its distribution is approximately normal – the inference of the sample data to the population as a whole is direct. Measurements of inference certainty can be easily calculated. For example, the 95 percent confidence interval around an estimate.

But the forms of the population are sometimes unknown or uncertain. In this way, the inference procedures cannot be so simple. In addition, according to Perkel J (2012), most of the time it is difficult to assess even the degree of uncertainty associated with complex data. They are also associated with the statistics needed to unravel complex social and behavioral phenomena.

Internal resampling methods attempt to assess this uncertainty by generating a series of simulated data sets similar to those actually observed. The definition of similar is crucial. Many methods have been devised that exploit different types of similarity. These methods give researchers the freedom to choose scientifically appropriate procedures. In the same way, it offers the possibility of replacing procedures that are valid under forms of distribution assumed by others that are not so restricted. The key to these methods is flexible and imaginative computer simulation.

Boostrap and Jackknife method

For a simple random sample, the “bootstrap” method repeatedly resamples the obtained data (with replacement). This can generate a distribution of possible datasets. In this way, one can simulate the distribution of any estimator and derive measures from the certainty of inference. The “jackknife” method repeatedly skips a fraction of the data. In this way, it generates a distribution of possible data sets that can also be used to estimate variability. These methods can also be used to eliminate or reduce bias. For example, the proportion estimator, a statistic commonly used in the analysis of surveys and sample censuses, is known to be biased. The jackknife method usually corrects this defect. The methods have been extended to other situations and types of analysis, such as multiple regression.

There are indications that, under relatively general conditions, these methods, and others related to them, allow for more accurate estimates. Especially the uncertainty of the inferences that the traditional ones. The latter are based on supposed (usually normal) distributions when that distributive assumption is not justified. For complex samples, this internal resampling or subsampling makes it easier to estimate the sample variances of complex statistics.

An older and simpler, but equally important, idea is to use a separate subsample in finding the data. This way you can develop a model and at least one separate subsample to estimate and test a selected model. Otherwise, it is almost impossible to take into account the excessively narrow fit of the model. It occurs as a result of the creative search for the exact characteristics of the sample data. These characteristics that are to some extent random and that will not predict other samples well.

Robust techniques

Data analysis is based on many technical assumptions. Some, such as the assumption that each element of a sample is extracted independently from other elements, can be weakened when the data is structured enough to support simple alternative models. An example of this is serial correlation. Typically, these models require the estimation of a few parameters. Assumptions about the forms of distributions, with normality being the most common, have proven to be particularly important. Considerable progress has been made in dealing with the consequences of different assumptions.

More recently, robust techniques have been designed that allow clear and valid discriminations between the possible values of the central tendency parameters. Especially for a wide variety of alternative distributions, reducing the weight given to occasional extreme deviations. It turns out that by renouncing, for example, 10% of the discrimination that could be provided under the unrealistic assumption of normality, performance can be greatly improved in more realistic situations. Especially when unusually large deviations are relatively common.

These valuable modifications of classical statistical techniques have extended to multiple regression. Here iterative reweighting procedures can now offer relatively good performance for a variety of underlying distributive forms. They should be extended to more general analysis schemes.

In some contexts – especially in the more classical uses of analysis of variance – the use of appropriate robust techniques should help to bring conventional statistical practice closer together. Specifically to the best standards that experts can achieve now.

Interrelated parameters

In trying to provide a more accurate representation of the real world than is possible with simple models, researchers sometimes use models with many parameters. All of these must be estimated from the data. Classical estimation principles, such as maximum direct likelihood, do not produce reliable estimates. This is unless the number of observations is much greater than the number of parameters to be estimated or special designs are used in conjunction with sound assumptions. Bayesian methods do not distinguish between fixed and random parameters. This is why they may be particularly suitable for this type of problem.

Recently, various statistical methods have been developed that can be interpreted as the treatment of many of the parameters. Specifically according to Salsburg (2017), they can be used in random or similar quantities, even if they are considered to represent fixed quantities to be estimated. Theory and practice show that these methods can improve the simpler fixed-parameter methods from which they have evolved. Especially when the number of observations is not large relative to the number of parameters.

Among the most successful applications are admissions to universities and graduate schools. Here the quality of the previous school is treated as a random parameter when the data are insufficient to estimate it well separately. Efforts to create appropriate models using this general approach to estimating small areas and adjusting under-counting in the census are important potential applications.

Lost data

In the analysis, serious problems can arise when a certain type of information (quantitative or qualitative) is partially or totally lacking. Various approaches have been developed or are being developed to address these problems. One of the methods recently developed to deal with certain aspects of missing data is called multiple imputation: each missing value in a data set is replaced by multiple values representing a range of possibilities, with a statistical dependency between the missing values reflected by the linkage between their substitutions.

It is currently being used to address a major problem of incompatibility between the 1980 Census Bureau’s public-use tapes and previous ones in terms of occupancy codes. The extension of these techniques to address issues such as lack of response to income questions in the Current Population Survey has been examined in exploratory applications with great promise.

Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.

If this article was to your liking, do not forget to share it on your social networks.

Bibliographic References

Ioannidis JPA (2012) Why science is not necessarily self-correcting. Perspect Psychol Sci 7:645–654.

Salsburg D (2017) Errors, Blunders, and Lies (CRC, Boca Raton, FL).

Perkel J (2012) Should Linus Pauling’s erroneous 1953 model of DNA be retracted? Retraction Watch. Available at retractionwatch.com/2012/06/27/should-linus-paulings-erroneous-1953-model-of-dna-be-retracted/.

You may also be interested in: The quality of cocoa already has its certification: a thesis of the UPCT evaluates the best beans in Ecuador

Statistical Data Collection

Statistical Data Collection

Non-grouped data

Grouped data

Data collection