8  Planning the Search

8.1 Query Crafting

Running your query is the first operational step of your systematic review: it’s often one of the first things you do after you publicly froze your preregistration. In that sense it’s kind of exciting, but ironically normally the results you obtain will not be surprising, since you repeatedly test your query while crafting it.

8.2 Queries as logical expressions

A query is a logical expression that specifies the conditions that must be met for bibliographic records to be returned by the interfaces that you use to search the bibliographic databases (see below). You first craft this query in a conceptual form, not worrying about the syntax that you will have to use to specify your query in a way the different interfaces can parse.

The simplest queries typically bind together sets of synonyms using the logical conjunction operator (often represented by AND, &, or &&). Each set of synonyms binds together various terms using the logical disjunction operator (often represented by OR, |, or ||). For example, imagine we’re doing a systematic review on the determinants of ecstasy use. In that case, a simple query could be:

("determinant" OR "determinants" OR "correlate" OR "correlates") AND ("ecstasy" or "XTC" or "MDMA")

This query has two terms. We could label the first “determinants”, since it is intended to capture all synonyms for “determinants” (it does so badly, since I wanted to keep this example short; such lists of synonyms are normally much longer), and we could label the second “ecstasy”, since its task is to match all records that contain a word for “ecstasy” (again, doing so badly to enable a brief example).

Using these two logical operators, it’s also possible to build more complex queries. For example, if we would know enough about substance use to realise that the determinants of trying out (i.e. “initation” of ecstasy use) are different from what you’d find if you do a determinant study into the determinants of “using ecstasy”, the second term would become more complex:

("determinant" OR "determinants" OR "correlate" OR "correlates") AND (("ecstasy" or "XTC" or "MDMA") AND (("using" OR "use") OR ("trying out" OR "initiating")))

The complexity of the query you end up with is often related to the complexity or “subtlety” of your goal or research question. If you’re conducting a scoping review, where you aim to map out the literature on a specific topic, you will generally have simpler queries then when you have a specific narrow research question.

Query complexity is often also related to the richness of the literature. If the literature on a topic is very extensive, the review may become unfeasible if you use a very simple, broad query: you might obtain tens of thousands of hits without the resources to screen all of those. Similarly, if you’re surveying a smaller literature, you can afford to have a less sophisticated query. Since screening costs a lot of time, it usually pays off to spend a lot of time developing your query so that you minimize the number of irrelevant hits.

8.3 Database fields

In addition to the terms themselves, you can specify the fields you want to search. For example, you can search all text fields (the default in most interfaces if you don’t specify one or more fields), or only the title field, or the title and the abstract and the keyword fields, et cetera. Usually you will want to search the titles at the bare minimum; and unless you are confident of relatively standardized vocabulary in a field, you’ll often want to add the abstract and keyword fields. Including fields like journal name, authors, or affiliations usually doesn’t make sense, so omitting explicitly specified field names is very rare.

Sometimes interfaces will allow you to specify multiple fields in a query, for example, indicating that a search term (e.g. a set of synonyms) can occur in either the title or the abstract; but often that’s not possible, and you have to duplicate parts of your query. This can cause queries to grow exponentially, and this is one of the reasons why it is important to craft your query on the conceptual level before starting the translation into the interface languages.

8.4 Wildcard characters

The query languages used by each interface have many advanced features that you can use to build powerful queries, and it is worthwhile diving into those. In addition to logical and other operators, another category of such features is wildcard characters. For example, the asterisk (*) can often be used to signify “zero or more alphabetic characters”, and the question mark (?) can often be used to signify “zero or one alphabetic characters”. This allows you to rewrite this query fragment:

"behavior" OR "behaviors" OR "behavioral" OR "behaviour" OR "behaviours" OR "behavioural"

into this much shorter fragment:

"behavio?r*"

Because such operators differ per interface, it usually pays off to obtain a thorough understanding of the capabilities of each interface you will use before starting to craft your query (or while doing so), since you will want to create a query that’s as powerful and versatile as you can, but you will have to do this within the constraints of the query languages of the interfaces you’ll use.

8.5 Team Consensus and Expert Consultation

It is important to achieve consensus with the team about the query before you finalize your preregistration and then run your query “for real”. If you miss important keywords, that can be a very expensive oversight to correct later on (depending on how smartly you designed your screening procedure; see below). For this reason, it is common to involve experts outside the research team to consult on the lists of synonyms and the logical structure of the query.

8.6 Databases versus Interfaces

Once you formulated your conceptual query, you can start translating it into the languages that the interfaces of the database you will use can understand. This language is generally specific to each interface. An interface is the application that performs the searches in the bibliographic databases for you and allows you to export the results in whichever format you want to use.

For example, PsycINFO is a bibliographic database maintained by the American Psychological Association. The APA keeps track of new articles that appear and adds them to PsycInfo. This database is accessible through various interfaces, and different institutions will have licenses with different interface providers. PsycInfo, for example, can be accessed through Ebsco, Ovid, and ProQuest. Ebsco, Ovid ad ProQuest use different interface languages, so the syntax and operators you have to use to formulate your query will be different.

Those interfaces are often (but not always) maintained by different organisations than those maintaining the databases. Sometimes, a database maintainer offers its own interface: PubMed is a good example of this. However, usually an interface is developed by a different organisation that then provides access to multiple databases through their interface.

This has a number of benefits. One is that once you’re familiar with a given interface, you can use those skills to search multiple databases. For example, your institutions may provide access to PsycInfo, MedLine, and PsycArticles through an Ebsco interface. It also allows you to search those databases simultaneously.

It also has a number of drawbacks. First, different interfaces work differently. The available operators, the symbols representing those operators, and the syntax you have to use to build a search query therefore differs per interface. If you want to search for a word, say “meta-analysis” in an article title, sometimes you indicate this by saying "meta-analysis" IN TI, and sometimes by saying TI("meta-analysis").

Second, the fields that exist differ per database. If you do search in multiple databases using the same interface, it is very important to clearly keep in mind which fields you search.

As a consequence of this heterogeneity in interface languages, once you crafted your conceptual query, you have to translate it into each interface’s language. Depending on how many fields you want to search and on the features of each language, this can explode your query into quite unwieldy strings of characters. Make sure to document both the conceptual query and the final query you input into each database/interface combination.

Therefore, if you conduct a systematic review, it is important to always preregister both the database(s) you plan to use and the interface(s) you plan to use. In addition, it is important to document the search query you use in every interface/database combination.

8.7 “Smart” searching

When conducting a systematic review, make sure to disable all “smart” searching features of the interface(s) you use. These features expand your query by including other synonyms. However, of course, this “smart” searching algorithm is in fact dumb: it cannot understand your goal(s) and/or research question(s), and so it will simply explode your query to find many more hits, the vast majority of which will by irrelevant to your goals/questions, because after all, you crafted a well-thought-through query.

A second problem of “smart” searching is that it is not replicable, since the algorithms implemented by these interface maintainers are adjusted over time. Since you cannot encode a “smart search version” parameter in your query specification, it’s not possible to solve this. As a consequence, using “smart” searching in effect renders your systematic review unsystematic: it can no longer be reproduced by other researchers — and worse, by yourself in the future.

Since systematic reviews typically take a year and often longer (see https://predicter.org/), you will often have to repeat your query towards the end, screening the additional hits, extracting entities from the additional inclusions, and re-running your analysis script. If your query was applied using “smart” searching, the results in this repeated query exectution can change unpredictably.

Therefore, never use “smart” searching in the final query you will use (and freeze in your preregistration). You can use it while crafting your query, to find additional sources to include, inspect the titles and abtracts for search terms you may have missed (people use the weirdest synonyms at times), and improve your conceptual query accordingly.

8.8 Query validation

Usually, you’ll already have a few sources (e.g. articles, book chapters, etc) that you know you will be including in your systematic review. While testing and perfecting your query, you’ll usually use these to check whether your query “works”: whether it finds the articles you know it’s supposed to find. If it fails to find one or more, then check whether it’s supposed to find it: all bibliographic databases have a limited scope, and so the source might simply not exist in that database (easily checked by entering its title as a query). If it was supposed to be included in the hits but wasn’t, then your query is missing one or more synonyms, so add those.

A quick way to check whether a given source is included in your query is by combining it with your query: basically create a “single use query” that combines the source’s title (or DOI, or ISBN, or any other unique identifier for the source) with your query using the AND operator.

8.9 Exporting query hits

Once you ran your queries, you will need to download the hits: i.e. the identified bibliographic records. There is usually a set of formats available: a very common format that is generally well-supported is RIS (a format developed by “Research Information Systems”), and another good choice is BibTeX. Before deciding on the format, make sure you know how you want to conduct your screening. Ideally, you will be able to easily repeat your query later, either when you revise the manuscript to make sure your findings aren’t outdated; if you updated your query because you discovered you made a mistake; or in the case of living reviews, when you want to update the review.