5  Planning the Extraction

5.1 Introduction to extraction

The extraction is the stage where the “data” are extracted from the identified sources. This means that the information from or about the included sources has to be stored in an extraction script file.

In a systematic review, the extraction and synthesis stages are the hardest (unless a meta-analysis is possible, in which case the extraction stage is the hardest). This is because the information you want to extract will often be ambiguous, and sometimes it will not be available at all. This ubiquitous ambiguity means that the task of extracting information is typically not a matter of copying information over: instead it’s more like playing detective.

5.2 Entities

Planning the extraction is also the hardest part of planning a systematic review. Planning the extraction means specifying the R extraction script, which requires specifying the entities to extract, how they are hierarchically organized, and which value templates each entity uses.

An entity is anything that has to be extracted from a source, such as the year something was published, its authors, a definition that was used in a source, a theory that was studied, a study design, a sample size, a measurement instrument, the literal text of measurement instrument items, expressions from interview participants, effect sizes, or things like the number or figures, tables or words in a source.

During the planning phase, you decide which entities you want to extract and how. A garden-variety entity represents one type of thing you want to extract from your included sources. You will define them in the Rxs specification, a spreadsheet (see below for details).

5.2.1 Identifiers and titles

For every entity, you will choose an identifier, a title, and a description. The identifier is a unique machine-readable name for your entity. This will allow you to easily select all extracted data for a given entity during the synthesis phase. Identifiers can only contain lower- and uppercase Latin letters, Arabic numerals, and underscores, and must always start with a letter. 1 Titles are the human-readable equivalent, so basically just the name of the entity, without any constraints as to which characters you’re allowed to use, but short.

5.2.2 Descriptions and extraction instructions

Descriptions are longer, and should contain at least two, and preferably three elements. First, the description should describe and define the entity. Since what you include here will be all that extractors, other researchers and interested parties, and Future You will have to go on, it pays to make an effort to be as explicit as possible.

Second, the description should contain explicit extraction instructions for the extractors — even if you yourself are going to be the only extractor (see the Be explicit for redundancy, transparency, and future you section, section @ref(planning-intro-explicity) in this version of the book).

Third, ideally, the description should explicitly list one or more edge cases. Edge cases are examples of something that a source might contain where it is not obvious how it should be extracted correctly. By listing these and explicitly describing why that example should be extracted the way you specify, you help extractors (including Future You) understand better how you delineate your entity definition.

5.2.3 Values to be extracted

To specify which types of values can be extracted for each entity, value templates can be specified (see the Value Templates section below). For each entity, the valid values, default value, and examples specified in the value template can be overridden in the entity specification.

5.2.4 Hierarchical structure and container entities

Because the number of entities extracted from the sources in a systematic review can become quite large, and are often clustered together, entities have a hierarchical data structure. This means they form a tree: each source is a root that can have leafs and/or branches attached. In data tree terminology, terms from two vocabularies are mixed: tree terms, such as root, branch, and leaf; and family terms, such as parent, child, descendants, and ancestors.

To familiarize yourself with these terms, consider the tree in ?fig-planning-extraction-tree-illustration.

# This {dot} chunk can't run yet as Chromium isn't installed on the server yet
//| label: fig-planning-extraction-tree-illustration
//| fig-cap: An example data tree with entities.

digraph treeIllustration {

  graph [rankdir=TB];
  node [shape=box,fontname=arial];
  edge [arrowhead=none];
  
  source -> general;
  source -> methods;
  source -> results;
  
  general -> publicationYear;
  general -> sourceAuthors;
  general -> sourceTitle;
  
  methods -> sample;
  methods -> method;
  methods -> variables;
  
  sample -> sampleSize;
  sample -> samplingStrategy;
  
  variables -> variable;
  
  variable -> variableIdentifier;
  variable -> measurementLevel;
  
  results -> associations;
  associations -> association;
  
  association -> associationIdentifier;
  association -> varId1;
  association -> varId2;
  association -> r;
  association -> t;
  
}

In this tree, the source itself is the root where all entities are attached. The three container entities attached to the root are general, methods, and results. These container entities are used to organise other entities: nothing is extracted for those contained entities themselves, they just function to organise and represent their children. The children of the general container entity (publicationYear, authors, and title) are themselves leaves: they have no further descendants.

Of the entities specified as children of the methods container entity, the method entity doesn’t have descendants either: that entity is also a leaf. The sample entity does have two children (i.e. is a branch), sampleSize and samplingStrategy, each leaves themselves (i.e. without children). The variables entity has one child that is itself a branch (variable), which has two children: variableIdentifier and measurementLevel (both leafs).

Finally, the results container entity contains one container entity called associations, which contains another container entity (i.e. a branch) called association, which contains three regular entities (i.e. leafs): varId, r, and t.

The position of an entity or container entity in the Rxs tree is specified by its parent, where the entity identifier of the parent container entity is specified.

5.2.5 Repeating entities

Sometimes, an entity, entity container, or clustering entity (see the next section) can potentially be extracted multiple times. For example, a source may report on multiple samples or may report multiple effect sizes. Therefore, some entities are repeating entities, which means the extraction script will be set up such that the corresponding lines can be copy-pasted multiple times. Normally, only clustering entities will be repeating. This will be explained in more detail in the next section.

5.2.6 Clustering entities (‘lists’)

In addition to container entities, that themselves contain no extracted data but are used to organize other entities, there are clustering entities. You can consider clustering entities as a special type of container entity that only contains leaf entities that are closely related to each other. In the extraction script template, these clustered entities (i.e. the leaf entities in a clustering entity) are placed on successive lines, with their titles, descriptions, value template descriptions, and examples all concatenated in one line after the bit where the entity itself is extracted.

There are two benefits to using clustering entities. First, they are more efficient during extraction, especially if the clustering entity is a repeating entity (see below). Second, metabefor has functions to supplement a clustering entity with entities from elsewhere in the extraction tree.

To illustrate this, again look at Figure @ref(fig:planning0extraction-treeIllustration). In this Rxs tree, there are two repeating clustering entities: variable and association. The variable clustering entity contains two clustered entities: variableIdentifier (a unique identifier for each extracted variable) and measurementLevel (the measurement level of this variable). The association clustering entity contains five clustered entities: associationIdentifier (a unique identifier for each extracted association), varId1 (the identifier of the first variable, referring to a variable clustering entity by its variableIdentifier), varId2 (the identifier of the second variable, also referring to a variable clustering entity by its variableIdentifier), r (a Pearson correlation coefficient), and t (a Student t value).

Both the variable and association clustering entities are repeating entities. This means that they can each be extracted multiple times by copy-pasting the relevant lines in the extraction script. Because each clustering entity has a unique identifier, they can be referred to, and each association clustering entity refers to two variable clustering entities.

Now, imagine a systematic review on gym membership, exercise, diet, and BMI. The extractor might encounter a source where they extract four effect sizes in four association clustering entities:

  • the Pearson correlation between height and weight;
  • the Pearson correlation between weight and daily energy ingestion;
  • the Pearson correlation between weight and daily exercise; and
  • the Student t value for the association between gym membership and daily exercise).

The extractor also specifies the measurement level for each variable in five variable clustering entities.

During synthesis, metabefor allows supplementing the association clustering entities with the information specified in the variable clustering entities using the unique identifiers specified in varId and varId2 and then looking for the corresponding clustering entities with that identifier in their variableIdentifier entity.

This is a trivially simple example, but this functionality is very powerful to extract efficiently and with high fidelity, while retaining flexibility and easily recombining the extracted entities during the synthesis stage to ultimately obtain data frames that lend themselves well to the intended synthesis.

5.3 Value Templates

Value templates are an efficient method to define a type of data to be extracted. The example metabefor Rxs specifications contain a number of common value templates:

  • numeric: Any valid number
  • numeric.multi: A vector of valid numbers
  • integer: Any valid whole number
  • integer.multi: A vector of integers (i.e. one or more whole numbers)
  • integer.length4.multi: A numeric vector of years
  • string: A single character value
  • string.multi: A character vector (i.e. one or more strings)
  • countrycode: A character vector of the ISO 3166-1 alpha-2 country code(s)
  • categorical: A string that has to exactly match one of the values specified in the “values” column of the Coding sheet
  • generalPresence: Whether the thing being coded was present or not.
  • string.mandatory: A single character value that cannot be omitted
  • string.entityRef.mandatory: A string that specifies another entity and which MUST be provided
  • string.entityRef.optional: A string that specifies another entity (can be missing, i.e. NA)
  • string.fieldRef.optional: A string that specifies another field in another entity (can be missing, i.e. NA).
  • matrix.crosstab: A table with frequencies; variable 1 in columns, variable 2 in rows; always work from absence/negative/less (left, top) to presence/positive/more (right, bottom)
  • string.identifier: A single character value that is used as an identifier and so is always mandatory and can only contain a-z, A-Z, 0-9, and underscores, and must start with a letter.

Each value template specifies a unique identifier, a description, optionally the valid values that can be extracted, a default value to insert into the extraction script template, one or more examples, an R expression to validate the extracted value (which implements the descriptions in the list above), and an error to show if that validation fails.

5.4 Details of the Rxs specification

The entities are specified in a spreadsheet called an Rxs specification. Rxs stands for R Extraction Script, and they are the machine-readable files that data from sources are extracted into. They are in fact R Markdown files that can be rendered as-is, but that can also be imported using metabefor. These files are created by metabefor based on the Rxs specification.

An very minimal example of such as spreadsheet is available at https://docs.google.com/spreadsheets/d/1Ty38BS7MVXOgC-GJ6zzr7E3rC_vQNOMKe-uCvIuHs3c. A more extensive example is available at https://docs.google.com/spreadsheets/d/13MUf8qL4Zmc5V6AOvjO1GWeFCl4IaaSl2zUT-Kk9tQc. See the @ref(example-projects) chapter in the Appendix for a list of example projects.

A spreadsheet holding an Rxs specification has at least the following worksheets:

  • entities: The specifications of the entities to be extracted in the systematic review.
  • valueTemplates: The value template specifications: an efficient way to specify ‘data types’ for entities.
  • definitions: Definitions of concepts used in the systematic review.
  • instructions: Instructions for the extractors.
  • texts: Texts to override metabefor’s default texts.

These will now briefly be described.

5.4.1 The entities worksheet

The entities worksheet has the following columns:

  • title: A short human-readable label for the entity (basically its name).
  • description: A longer human-readable description of the entity. Together with the value template descriptioin, this will form the instruction for the extractors, so make sure to clearly describe what they should look for in the sources.
  • identifier: A machine-readable identifier for this entity. This may only contain lower and upper case Latin letters (a-z and A-Z), underscores (_), and Arabic digits (0-9), and must start with a letter. This is used to refer to extracted entities in the results, or when cross-referencing entities (e.g. in the parents column).
  • valueTemplate: The identifier of the valueTemplate to use (see the valueTemplates worksheet).
  • validValues: Overrides the validValues specified in the specified `valueTemplate.
  • default: Overrides the default value specified in the specified `valueTemplate.
  • examples: Overrides the examples specified in the specified valueTemplate.
  • parent: The entity’s parent entity: in the hierarchical tree of extracted entities, the parent is the entity that this entity will fall under. For example, in the Rxs specification for the the example tree shown above, the entities samplingStrategy and sampleSize each list sample in the parent column.
  • list: If list is set to TRUE, that designates this entity as a clustering entity. That means that the entities it contains are clustered entities that are presented in the extraction script in a list(). This allows for more efficient extraction of the child entities. However, is also means that in the tree of extracted entities, these child entities (i.e. the clustered entities) cannot themselves have child entities. In other words, those child entities are all leafs on the tree.
  • repeating: Set repeating to TRUE for entities that can be extracted multiple times. This is useful for, for example, effect sizes or other statistics, which can be extracted multiple times for a given source, but always have the same specifications.

These columns are also included, but contain functionality that is both quite advanced / abstract and not yet fully implemented in metabefor:

  • collapsing: To be added.
  • recurring: To be added.
  • recursing: Set recurring to TRUE for entities that can recurse: that can contain themselves.
  • identifying: Set to TRUE if this is entity if an identifier.
  • entityRef: It is often useful to specify that extracted information relates to a specific entity (usually an entity container). In such cases, this column can be used to specify which entity is referred to. This is then used during validation to verify whether in the tree object, the value specified for this entity occurs as one of the values specified for the entityRef entities. For example, when conducting a meta-analysis, it is typically useful to extract the variables measured in a study as well as estimates for associations between those variables. Using entityRef entities, it is possible to extract the measurement instrument used for the relevant variables only once, and then refer back to those entities using the entityRef entity.
  • fieldRef: [ this is advanced functionality that still has to be implemented in metabefor ] Sometimes, extracted information does not relate to another entity, but to one specific value for an entity specified in the entityRef. The fieldRef field allows you to specify the identifier of the entity within the entity referenced in the entityRef entity to which the parent entity pertains.
  • owner: This entity’s owner - specifying an owner signifies that all entities with that identifier must contain at least one entity with the current identifier.

5.4.2 The valueTemplates worksheet

  • identifier: The unique identifier of this value template, used in the entities worksheet to specify that this value template should be applied to an entity. This must be a machine-readable identifier, and so may only contain lower and upper case Latin letters (a-z and A-Z), underscores (_), and Arabic digits (0-9), and must start with a letter.
  • description: A description of this type of value. This will be shown in the Rxs template for every entity that this value template will be applied to. Specifically, extractors will see this description printed below those entities.
  • validValues: Optionally, a list of valid values. Each value must be separated by double pipes (||). For example: "Unknown" || "Present" || "Absent" means that one or more of those three strings must be extracted.
  • default: The default value inserted in the Rxs template.
  • examples: Examples of extracted values.
  • validation: An R expression to validate the extracted entity.
  • error: An error message to show if the validation fails.

5.4.3 The definitions worksheet

Here, you can specify definitions that are important in your project. They will be inlcuded in the extractor instructions, together with the contents of the instructions worksheet. There are two columns:

  • term: A term.
  • definitions: The corresponding definition.

5.4.4 The instructions worksheet

Here, you can specify instructions for your coders.There are two columns:

  • heading: A heading, which will be included as a heading in the instructions.
  • definitions: The instructions that should be displayed below that heading.

5.4.5 The texts worksheet

This functionality has not been implemented yet, but it will allow overriding the default texts produced by metabefor.

  • textIdentifier: A unique identifier for the text fragment.
  • content: The text fragment that should be used.

5.5 Post-hoc entity specification: Txs specifications

Tabulated Extraction Sheet specifications


  1. If you’re already familiar with regular expression, the regular expression is [a-zA-Z][a-zA-Z0-9_]*. If you’re not already familiar with regular expressions: they’re an extremely powerful tool to describe, search for, and replace text fragments, well worth at least a brief acquaintance.↩︎