In its most basic sense, metadata is information about the data, and describes basic characteristics of the data, such as:
Who created the data
What’s in the data file
When the data was generated
Where the data was generated
Why the data was generated
How the data was generated
Metadata makes it easier for you and others to correctly identify and reuse the data at a later date.
Refined metadata
Structured metadata not only favors the long-term discovery and preservation of your research data, but also enables the simultaneous aggregation and search of research data from tens, hundreds, or thousands of researchers.
Therefore, domain-specific repositories often require highly structured metadata with their data submissions: it allows highly granular searches in their added content. This, in turn, makes your data easier to find.
Experimental data collection
In all likelihood, you’re already capturing the necessary metadata about your research. Your lab notebooks and research files contain much, if not all, of this information, such as
Name of the researcher
date
Project
Details of the experiment/analysis being carried out, including the purpose and methods used
Sources of other data used in the experiment/analysis
The key is to collect all the necessary information (metadata) as you work and then link that metadata to the data files themselves.
If you’re the only person using this data, the metadata may not need to be very structured to be useful. However, the metadata should be fairly complete. This will help you later to refer to these files. It will also make the future structuring of your metadata into a formalized standard easier and less complicated.
Metadata tracking
Consider one or more of these methods for tracking metadata and data files:
Keep a paper notebook with information about your projects, noting the locations and names of digital files associated with individual experiments.
Keep a digital notebook with information about your projects with hyperlinks embedded to relevant data files.
Include a note in each data file that indicates the location of the metadata.
In each folder on your computer that contains research data, include a text file that describes the contents of the files in that folder, including explanations of the abbreviations and column headings of the files. You can also include references to publications that describe the data.
You may not need your metadata to be heavily structured to understand the contents of your files right now. However, including as much structure as possible can help you better or more quickly understand the data in the future. It will also help other people understand your data without them needing help or explanations directly from you.
Metadata standards
To submit your research to a data repository, you may be asked to format your metadata using a metadata standard. Refer to the repository you are going to use to determine what its metadata requirements are.
Metadata structures are often referred to as “schema”. The schema will have a defined set of characteristics to describe the data. Completed metadata is typically presented in a machine-readable language, such as XML.
As an example, the Dublin Core metadata element set contains the following 15 basic properties. You can see the comments and explanations of all the terms below on the Dublin Core website.
Collaborator: Entity responsible for making contributions to the resource.
Coverage: The spatial or temporal issue of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
Creator: An entity primarily responsible for the development of the resource.
Date: A point or time period associated with an event in the resource lifecycle.
Description: A description of the resource.
Format: The file format, physical media, or dimensions of the resource.
Identifier: An unambiguous reference to the resource within a given context.
Language: The language of the resource.
Publisher: An entity responsible for making the resource available.
Relationship: A related resource.
Rights: Information about the rights you have over the resource.
Source: A related resource from which the described resource is derived.
Subject: The subject of the appeal.
Title: Name given to the resource.
Type: The nature or gender of the resource
Metadata Standards
Metadata standards or schemas consist of specific elements used to describe or document your data. Some disciplines have established metadata standards. In addition, some data repositories have their own rules. There are also several general purpose schemes that you can tailor to your needs.
If you don’t use a standard metadata schema whose details are widely known and easily accessible to other researchers, be sure to keep the schema itself and its documentation, along with the data and metadata. This will help ensure that you and others can understand and reuse your data in the future.
Examples of metadata standards
Listed below are several known and frequently used metadata standards.
Dublin Core: A General Purpose Metadata Standard for Describing Networked Resources
Metadata Object Description Schema (MODS): a set of bibliographic elements that can be used for various purposes, and in particular for library applications. Metadata Encoding and Transmission Standard (METS) is a useful variation of MODS
Federal Geographic Data Committee (FGDC) Standard: ISO international standard for the description of geospatial data
Encoded Archival Description (EAD): standard for encoding search aids for use in a network environment
Data Documentation Initiative (DDI) Standard: An international XML-based standard for the content, presentation, transport, and preservation of documentation (i.e., metadata) from social and behavioral science datasets.
About ontologies
Ontologies are shared vocabularies that are used to describe the components of a given discipline and the relationships between these components. Using ontologies makes it easier for other people (or even the future) to understand your data. Controlled vocabularies, on the other hand, are nothing more than lists of predefined and authorized terms.
In addition to using a metadata standard, you may want (or be required) to use ontologies or controlled vocabularies to create your metadata. For example, if you use the Dublin Core as a metadata schema, they recommend that you use the Internet Media List, a controlled vocabulary, to enter information into the “Format” tag. It is also recommended to use a controlled vocabulary to introduce thematic terms, but it is up to you to choose which vocabulary to use.
Below are some examples of ontologies and controlled vocabularies that are currently used in various disciplines:
Bioportal
The portal of the National Center for Biomedical Ontology of the United States, hosted at Stanford.
Gene Ontology
A bioinformatics initiative that aims to standardize the representation of gene gene and gene product attributes across species and databases.
Medical Subject Headings (MeSH)
Controlled vocabulary used to index articles for PubMed.
Web Ontology Language (OWL)
Ontology used for the semantic web.
Getty Thesaurus of Geographic Names (TGN)
Controlled vocabulary that includes names and other information about places, administrative political entities, and physical characteristics.
RFC4646
This vocabulary provides a mechanism for describing the language of an object.
Chemical Entities of Biological Interest (ChEBI)
Ontology of small chemical compounds.
Microarray Gene Expression Society Ontology (MGED)
Ontology designed to describe microarray experiments.
Internet Media List
Controlled vocabulary of Internet media file types.
Environmental Ontology (EnvO)
Ontology used to describe the environments of any organism or biological sample.
Ontologies of reaction names, chemical methods and molecular processes
Ontologies for chemistry from the Royal Society of Chemistry (RSC)
README file
A README file is a plain text file that includes descriptive information commonly used for software, games, and code. It is a complementary document that exists so that the creator can explain the content to the user. When working with data, it can be useful to create and include a README file with the data. This ensures that future users understand the data, terms and so on.
There are no rules for writing a README text file, but it is recommended to include:
Title
Principal Investigator(s)
Dates/places of data collection
Keywords
Language
Financing
Descriptions of each folder, file, format, data collection method, instruments, etc.
Definitions
People involved
Recommended appointment
Metadata tools
There are several free tools for creating metadata. Some of them help you select controlled vocabularies to include in your documentation, while others combine that functionality with a fully supported metadata schema. Below you’ll find brief descriptions of several useful tools, along with links to download and installation instructions, documentation, tutorials, and user guides. Refer to the feature comparison table that provides additional information to help you find the right tool for your particular project, platform, and needs.
Annotare
Annotare is a form-based software for annotating biomedical research and the resulting data. It supports biomedical ontologies, contains standard templates for common experiment types, and includes a design wizard to create your own forms.
CEDAR Workbench
CEDAR Workbench is an open source tool for managing metadata, using rigorous semantic principles if desired. Allow users to specify templates using a user interface (such as Google Forms or Survey Monkey survey forms), and then fill out those forms efficiently using drop-down menus, help tips, and smart suggestions. Templates and metadata can be shared with other users and groups. Metadata can also be downloaded to JSON-LD, simple JSON, or RDF, or exported to connected repositories, which can be integrated using the full set of APIs.
ISA Creator
ISA Creator is a standalone open source application that helps plan and describe experiments and makes it easy to export and import data directly to and from some public repositories. Additional tools are available in the ISA-Tools software package for analyzing ISA-Tab on R data structures and for parsing PERL and Python for ISA-Tab. ISA-Tab is the format required to publish data in Nature Publishing’s Scientific Data journal. This software creates separate descriptive files for your experimental files.
Morpho
Morpho allows you to describe ecological experiments and create a catalog of data and descriptions that you can consult. It includes an interface with the Knowledge Network for Biocomplexity (KNB) to share, query, view and retrieve data.
OMERO
OMERO is a repository software for importing, viewing, organizing, describing, analyzing, and sharing microscopy images from anywhere with Internet access. It includes the ability to create groups of users with different permissions to share data.
OntoMaton
OntoMaton provides ontology search and automated tagging through the NCBO Bioportal of Biomedical Ontologies within Google Sheets. This tool is part of the ISA-Tools suite. Annotations are generated within your tabular data file.
RightField
RIghtField is an open source tool that allows you to search and select ontological terms from Microsoft Excel. This tool allows you to assign a default list of options to a specific cell within the worksheet. All annotations are integrated into the worksheet. The user can select from ncBO BioPortal ontologies or import an ontology from a URL or their local machine.
Storage
Data is usually stored in an easily accessible secondary location. The data is usually reflected, which means that the data in the secondary location is identical to the original version.
An example of data storage would be the AFS system. Every time you access your AFS storage space, you will see exactly the same files and folders that you see on your desktop machine. It’s almost as easy to access AFS as your local files, but the data is stored in a physically separate location.
Backup
Data backups are typically done in a separate physical location that may be more difficult to access than the usual storage space (though not necessarily). Backups are snapshots of your file information at a given time. Typically only one version of the backup is saved, not several.
The Time Machine software on a Mac is a good example of a backup system. Capture exactly what your files contained at any given time. Older versions are purged as new versions are created.
Some systems, such as AFS, have both storage and backup functions. Check out our list of backup solutions for more options.
Sensitive data
Many researchers work with information about patients’ health or other personal data. These types of data are classified into different categories, each of which requires its own level of security. See our page on sensitive data, which includes more information on data classifications and on the storage and backup of sensitive data.
Conservation
Keep in mind that backing up your data is not the same as and is not a substitute for long-term preservation.
Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.
If this article was to your liking, do not forget to share it on your social networks.
You may also be interested in: Data Management in Research
Bibliographic References
Park, J., Tosaka, Y., Maszaros, S., & Lu, C. (2010). From metadata creation to metadata quality control: Continuing education needs among cataloging and metadata professionals. Journal of Education for Library and Information Science,51(3), 158-176.
Park, J., & Tosaka, Y. (2010). Metadata creation practices in digital repositories and collections: Schemata, selection criteria, and interoperability. Information Technology and Libraries,29(3), 104-116.
Riley, J, & Dalmau, M. (2007). Developing a flexible metadata model for the description and discovery of sheet music. Electronic Library, 25(2), 132-147
You might also be interested in: Data Management in Research