For many scientists, especially in the life sciences, the process of retrieving literature is tedious, error-prone and non-transparent. On the other hand, the quality of literature search has been recognized as one of the key characteristics for the generation of high-quality scientific evidence. Between the growing needs for retrieval of scientific literature and the variety of different bibliographic resources, scientists, as users of literature, call for a unified, easy-to-use and reliable entry point to scientific information such as Google Scholar.
In light of this justified demand, recent results on the quality of Google Scholar as a resource for the retrieval of scientific literature, including for systematic reviews, gave rise to high expectations about the emergence of such a unifying search interface.
Many information specialists seem to be overwhelmed by the rapid evolution of databases with a multitude of technical characteristics. In a continuous process of debate and agreement, the communities of information sciences and the recovery of scientific literature have developed methods and standards that aspire to a high level of quality in the recovery and presentation of literature. Many information scientists advocate for robust validation of Google Scholar and other tools in development before advertising them.
Difference between Scientific Literature Base and Scientific Search Engine
The distinction between “scientific literature database” and “scientific search engine” should not be taken too literally, at least from a technological point of view: on the one hand, a competitive literature database uses high-end natural language processing and indexing technology to process its inputs and Internet technology to deliver the results. On the other hand, all indexes generated by a crawler are stored in databases and can be improved with semantic technology.
Thus, in the future, the discrimination between “bibliographic database” and “scientific search engine” will be blurred. A “scientific literature database” (e.g. MEDLINE) is only accessible to users via a “search engine” with its user interface (e.g. PubMed or OvidSP). Google Scholar is an academic web search engine that does not provide its own resource of reference information to users. Also, Google Scholar’s search engine and user interface link directly from its index to documents on the web.
Google Scholar Applicability
Google Scholar’s own documentation makes no claims about the applicability of Google Scholar in certain contexts. It says nothing about the completeness of the coverage or the quality of the recovery outcomes. The user receives the service “as is”, as clearly indicated in Google’s legal notice. In reference to its own official statements, Google Scholar tries to cooperate with the editors and producers of scientific texts and provides help on how to prepare papers for indexing by Google Scholar.
However, when resources are closed, access is restricted, for example, through password protection, or when the owner of the resource does not want to cooperate, Google cannot process the respective documents. Thus, Google Scholar depends on the fundamental accessibility of scientific texts through the Internet or on the willingness of publishers and libraries to cooperate and open their repositories for indexing.
Search expressions in Google Scholar
The general structure of the search expression is the simple conjunction of terms, phrases or subexpressions connected with the Boolean AND. In Google Scholar the DNA is expressed as a space ‘ ‘ between terms, phrases or subexpressions.
Terms in Google Scholar are complete individual words (truncation is not possible). Google Scholar applies automatic stemming to terms whose root is recognizable to Google Scholar. However, this mechanism may not be reliable for the language of a specific domain (for example, medical language).
Phrases in Google Scholar are one or more terms separated by spaces in quotation marks ‘”‘. These (connected) phrases are searched by Google Scholar exactly as they are provided to the search interface.
Subexpressions in Google Scholar are disjunctions of terms and phrases connected with a Boolean OR. In Google Scholar the OR is expressed as an “OR” between parts of the search. The underexpression must be enclosed between a pair of parentheses ‘( … )’.
Is Google Scholar ready to be used only for systematic reviews?
Google Scholar has very high coverage for certain topics in clinical medicine. This is an important precondition for the applicability of Google Scholar as a search engine for systematic reviews. However, these results cannot be generalized to structured search procedures or to all subject areas of biomedical science.
Another important question is how Google Scholar integrates with current professional conduct in the retrieval of scientific literature. For professional work in all fields, scientific search interfaces have certain features and provide at least the following built-in tools:
Reliability and stability of search results in time and place.
Search result set export functions.
A history feature that temporarily stores search results for refinement of search strategies.
Support for the documentation of search strategies.
Advanced user interfaces that enable you to compose complex search expressions
Google Scholar Search Technology
Google Scholar uses Google search engine technology. As such, it is not a bibliographic database in the traditional sense like MEDLINE, Embase or the Web of Knowledge. In a more traditional scientific literature database, entries for a reference database are collected from selected scientific journals, books, and other resources that meet certain quality criteria. Reference information is extracted and stored in a separate database, for example, the MEDLINE database. In addition, the information collected is automatically indexed and processed in part by individuals.
In Google Scholar, an automated software program called crawler visits academic papers accessible on the Internet and builds a full-text index by storing the words extracted from the full text along with a link to the source document. However, the reference information itself is not accessible through an additional Google Scholar reference database.
Therefore, Google Scholar indexes can only contain references that can be accessed over the Internet in any form, for example, as a full text, through a publisher’s website, or as a quotation from the full text of a cited work. Therefore, it cannot be guaranteed that all references accessible at any given time will be recoverable at all subsequent times. Search results will change over time when indexing changes due to the accessibility of source documents or databases.
Google Scholar’s indexing engine implements some natural language processing algorithms to process words collected from sources. In addition, Google Scholar automatically extracts citation information from references. This technology, known as autonomous citation indexing, also applies to the Web of Knowledge and Scopus, albeit with different results.
To provide the user with a meaningful ranking of referrals, Google’s search engine technology uses ranking algorithms that not only analyze the match between the search expression and the full text. References are also classified according to how often they are cited by other references and other information. Due to their large size and technological power, Google and Google Scholar are able to index everything that is accessible through the Internet, store it in large distributed databases and offer results in milliseconds.
One of Google’s main goals is easy access and ease of use. This policy may be suitable for many uses, but restricts users to a simple search interface that is not sufficient to express more complex queries. Google Scholar follows Google’s main interface with the simplest possible form of interaction: a single text input field (hereinafter referred to as the “simple search interface”). In addition, there is an “advanced search interface”. This interface allows you to connect search terms with logical operators or use exact phrases in search expressions.
Google Scholar: a search engine for scientific literature with known limitations
When compared to professional literature search interfaces (PubMed, OvidSP, Web of Knowledge) regardless of the underlying data sources, Google Scholar has some important limitations:
The search fields of the simple and advanced search interfaces are limited to expressions that do not exceed a length of 256 characters.
This factor severely impairs the applicability of Google Scholar, as it limits the overall expressiveness of searches to very short expressions. In addition, if it is not carefully checked that the intended full expression is used, the search interface truncates the expression after 256 unannounced characters and may leave a meaningless short phrase or term that increases the number of false positive results.
No more than 1000 results from the full result set can be displayed in steps of up to 20 results per page.
Bulk export of results is not available. Results can only be exported to a reference management software (e.g. ZOTERO) for the maximum number of references per page (20). With this limit, Google Scholar cannot be integrated into a professional process of selecting references for systematic reviews.
Google Scholar has no truncation operators.
Google Scholar search expressions should use full words. An automatic stemming mechanism is used to detect a common word root, however, this mechanism does not work reliably. For example, it is not enough to search for “child” to find the terms “child”, “childhood” and “children”, the same goes for “randomisation”, “randomization”, “randomized” and “randomised”.
Logical operators can be used, although only without nesting logical subexpressions at more than one level.
It is possible to use conjunctions of terms, phrases, and subexpressions connected with the AND logical operator. Google Scholar uses a space ‘ ‘ to express the logical DNA. Subexpressions are disjunctions of terms and phrases connected to the logical OR and must be enclosed in parentheses ( … ) at one level (see an example below). This feature is not documented.
Although the Google Scholar search interface has been improved for the correct interpretation of the logical connectors, the results of the retrieval are still not stable against the variation of the sequence of search terms of otherwise logically equivalent search expressions. The result set of a search with the expression oesophagus OR esophagus has a size of 545,000. The equivalent esophageal or esophageal logical search has a size of 565,000 references.
It is not possible to construct all possible expressions in the advanced search interface due to the limited number of input fields available.
Only one field is available for each type of expression (conjunction, disjunction and conjunction of sentences), which is not enough to construct, for example, a simple conjunction of two disjunctions. Example:
(bleeding OR bleeding) AND (ESOPHAGUS OR esophagus)
These search expressions with more than one underexpression must be built in a text editor outside of Google’s search interface. Once built, they must be copied and pasted in their entirety into the single input field of the simple search interface. In addition, the advanced search interface analyzes more complex expressions in their fields, although the limited number of fields is not enough to cover the meaning of the search expression (example above). Therefore, the advanced search interface could distort a query to an expression with completely different semantics. Therefore, a complex search inserted into the simple search interface should never be sent from the advanced search interface.
The Google Scholar update may not be very high for some resources.
The update period for certain resources is up to nine months. Although research results indicate that Google Scholar coverage is very high, the exact coverage is not known. Google itself claims that it does not index journals, only articles, and does not claim to be exhaustive.
Literature that is not available in digital format cannot be reliably searched. Only references to citations from this literature can be found and, consequently, can only be searched by title words and authors.
Some fields in the advanced search interface are not available in a search expression as a keyword or field indicator. While authors can be specifically searched with the field indicator “author” in an expression such as “author: author’s name”, the date is not accessible by a field indicator.
Our specialists wait for you to contact them through the quote form or direct chat. We also have confidential communication channels such as WhatsApp and Messenger. And if you want to be aware of our innovative services and the different advantages of hiring us, follow us on Facebook, Instagram or Twitter.
If this article was to your liking, do not forget to share it on your social networks
Sampson M, McGowan J: Errors in search strategies were identified by type and frequency. J Clin Epidemiol. 2006, 59: 1057.e1-1057.e9.
Maggio LAM, Tannery NH, Kanter SL: Reproducibility of literature search reporting in medical education reviews. Acad Med Aug 2011. 2011, 86: 1049-1054.
Boeker M, Vach W, Motschall E: Semantically equivalent PubMed and Ovid-MEDLINE queries: different retrieval results because of database subset inclusion. J Clin Epidemiol. 2012, 65: 915-916. 10.1016/j.jclinepi.2012.01.015.
You might also be interested in: Startup with social impact: from a thesis they created a platform for microentrepreneurs