12 Executing the Extraction

12.1 Data management

12.2 Extracting entities

12.3 The structure of an Rxs file

Extraction occurs in Rxs files. Rxs files have the extension “.rxs.Rmd” (the last part “.Rmd”, is the extension of R Markdown files; this is because Rxs files are also R Markdown files). These are plain text files that you can edit with any text editor, such as Notepad, Notepad++, TextEdit, or any other editor.

However, it is best to use RStudio, because then you can easily validate the values you extracted while you still have everything fresh in your mind.

12.3.1 The bits you can ignore

The Rxs file is structured in four sections. During extraction, only the second section matters: you can (and should) ignore the other components. This second component starts on line 4, with a comment that, by default, is the following:

<!--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-->
<!--                                                                       -->
<!-- Welcome to the R Extraction Script (.rxs.Rmd file) for this source!   -->
<!--                                                                       -->
<!-- You can now start extracting. If you haven't yet studied the          -->
<!-- extractor instructions, please do so first. If you're all set, good   -->
<!-- luck!                                                                 -->
<!--                                                                       -->
<!--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-->

The text in this box can be customized, so you might see something else instead. This second section ends with a similar comment, that by default has the following text:

<!--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-->
<!--                                                                       -->
<!-- Well done! You are now done extracting this source. Great job!!!      -->
<!--                                                                       -->
<!-- Now, please knit the R Extraction Script into an HTML file and        -->
<!-- carefully check whether you entered everything correctly, since it    -->
<!-- will cost much less time to correct any errors, now that you still    -->
<!-- have this source in your mind, than later on when you'll have to      -->
<!-- dive into it all over.                                                -->
<!--                                                                       -->
<!--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~-->

In between these two comments, you extract the values for the specified entities. Consequently, you can ignore everything above the first comment and below the second comment.

12.3.2 The sourceId block

After the first comment, the first thing you have to specify is the unique source identifier (the sourceId) for the source (e.g. an article, book, report, or other source of entities you will extract) you are extracting.

The block for the source identifier looks like this:

##############################################################################
############################################ START: uniqueSourceIdentifier ###
##############################################################################
uniqueSourceIdentifier <- 
##############################################################################
### 
### SET UNIQUE SOURCE IDENTIFIER
### 
### A unique identifier used in this systematic review to refer to this
### source
### 
##############################################################################
    
    ""
    
##############################################################################
########################################### VALUE DESCRIPTION AND EXAMPLES ###
##############################################################################
### 
### A unique identifier to use in this systematic review. For sources
### with a DOI, this is the last part of the shortDOI as looked up
### through https://shortdoi.org (the part after the "10/"). For sources
### without a DOI (and so, without a shortDOI), this can be, for example,
### the QURID (Quasi-Unique Record Identifier) that was designated during
### the screening phase or which you can create with
### `metabefor::qurid()`.
### 
### EXAMPLES:
### 
### "g5fj"
### "qurid_7h4pksl6"
### 
##############################################################################
############################################## END: uniqueSourceIdentifier ###
##############################################################################

As you see, there are a lot of hashes (or ‘pound signs’: the # symbol) to structure this block visually.

We will now walk through this block and inspect the bits you need to pay attention to.

12.3.2.1 The block start marker

The block starts with the block start marker. This marker indicates that this is the start of the block, which also specifies to which entity this block belongs:

############################################ START: uniqueSourceIdentifier ###

Here, you see that this entity has unique identifier “uniqueSourceIdentifier”, which seems fitting for extracting the unique source identifiers.

12.3.2.2 The entity description sub-block

You can then ignore everything up until the sub-block with the entity’s label and description:

##############################################################################
### 
### SET UNIQUE SOURCE IDENTIFIER
### 
### A unique identifier used in this systematic review to refer to this
### source
### 
##############################################################################

The entity’s label is shown fully capitalized, followed by this entity’s description.

12.3.2.3 The block core: where you specify the extracted value

You then see a blank line, two indented double quotes, and another blank line:

""

This is the core of this block. It is where the entity value that you extract is specified (in between the double quotes). The double quotes are the default value for this entity. In this case, they represent an empty text string.

Note

Note that for computers, literal text strings always have to be in between either a pair of double quotes ("like this") or a pair of single quotes ('like this'). Double quotes usually make the most sense, because single quotes also serve other functions as apostrophes, so when you copy-paste a text, especially in Engelish, it’s likely to contain one or more single quotes (e.g. in the word “it's”). If that text string is then delimited by single quotes, you will cause an error, because as far as the computer knows, you unintentionally stop the text string specification as soon as the first “'” in the text string is encountered.

12.3.2.4 The value description sub-block

This position, where you specify the value for this entity, is followed by another sub-block:

##############################################################################
########################################### VALUE DESCRIPTION AND EXAMPLES ###
##############################################################################
### 
### A unique identifier to use in this systematic review. For sources
### with a DOI, this is the last part of the shortDOI as looked up
### through https://shortdoi.org (the part after the "10/"). For sources
### without a DOI (and so, without a shortDOI), this can be, for example,
### the QURID (Quasi-Unique Record Identifier) that was designated during
### the screening phase or which you can create with
### `metabefor::qurid()`.
### 
### EXAMPLES:
### 
### "g5fj"
### "qurid_7h4pksl6"
### 
##############################################################################

This sub-block is marked “VALUE DESCRIPTION AND EXAMPLES”. Unlike the entity label and description that we saw above, this sub-block tells you what kind of value has to be extracted for this entity. So, whereas the first sub-block tells you what you need to specify here (e.g. what to look for in the PDF of a source), this second sub-block tells you how you need to specify what you extracted.

For example, some entities always have to be numbers (e.g., number of participants in a study, or a correlation coefficient). Some numbers always have to be whole numbers, without decimals (e.g., number of participants in a study), whereas others can contain decimals (e.g., a correlation coefficient). Sometimes you need to extract a text string. Sometimes you can extract a so-called vector containing multiple text strings or numbers.

This description below the block core (where you specify the value you extracted for this entity) tells you how to specify that value given what was decided during the planning of this systematic review.

Note

You specify a vector using “c()”. For example, to extract to numbers, say 1 and 2, you would specify “c(1, 2)”. This “c()” stands for “combine”, because it lets you combine two or more values into one vector.

12.3.2.5 The block end marker

Finally, this block ends with the block end marker:

##############################################################################
############################################## END: uniqueSourceIdentifier ###
##############################################################################

These are basically the exact same lines as the block start marker, except that the entity identifier for this block is now preceded by the word “END”, instead of by the word “START”.

This block end marker closes the block for this entity. In this case, this means the block for the source identifier is done.

12.3.2.6 The value to specify here

The value you specify as “uniqueSourceIdentifier” is described in the value description sub-block (the sub-block marked with “VALUE DESCRIPTION AND EXAMPLES”).

If you don’t have it yet, you’ll have to get the ShortDOI for this source, assuming the source has a DOI. A ShortDOI is a unique brief unique identifier for any object that has a Digital Object Identifier (i.e. a DOI). You can find the ShortDOI that corresponds to a given DOI at https://shortdoi.org.

If you’re extracting a source that doesn’t have a DOI, normally, during screening every screened entry will have received a QURID, a Quasi-Unique Record Identifier. You can then copy that from the extraction spreadsheet, which is where you probably also have to copy-paste the ShortDOI if that’s what extracted as the value for “uniqueSourceIdentifier” (refer to your extraction instructions for the details).

When you’re done, the block core should look something like this:

    
    "gqw5jr"

Note that you shouldn’t include the full ShortDOI, but instead omit the “10/” that it starts with, so you only specify letters and digits.

12.3.3 The extractorId block

Then, you move on to the second block: the extractorId block. This extractor identifier is the identifier for you, so that later on in the project, it’s still clear who extracted this source.

With your team, you will agree on identifiers for every extractor. Note that because the Rxs files will be made public, these extractor identifiers will also become public, so you may want to avoid using your name (on the other hand, this could also be a reason to deliberately use your name: just think about what you would prefer).

The extractor identifier block looks like this:

##############################################################################
############################################### START: extractorIdentifier ###
##############################################################################
extractorIdentifier <- 
##############################################################################
### 
### SPECIFY YOUR EXTRACTOR IDENTIFIER
### 
### An identifier unique to every extractor
### 
##############################################################################
    
    ""
    
##############################################################################
########################################### VALUE DESCRIPTION AND EXAMPLES ###
##############################################################################
### 
### Identifiers can only consist of (lower or uppercase) Latin letters
### [a-zA-Z], Arabic numerals [0-9], and underscores [_], and always have
### to start with a letter.
### 
### EXAMPLES:
### 
### "extractor_1"
### "Alex"
### 
##############################################################################
################################################# END: extractorIdentifier ###
##############################################################################

You should now recognize the block start marker (the bit saying “START: extractorIdentifier”) and the block end marker (the bit saying “END: extractorIdentifier”), as well as the entity description sub-block (with label “SPECIFY YOUR EXTRACTOR IDENTIFIER” and description “An identifier unique to every extractor”), the block core (the two double quotes, “""”), and the value description sub-block (marked with “VALUE DESCRIPTION AND EXAMPLES” and then proceeding to describe the constraints that an identifier must satisfy).

In this case, you type or copy-paste your personal extractor identifier in between the double quotes in the block core. When you’re done, the block core should look something like this:

    
    "myExtractorId"

12.3.4 Starting the actual extraction

Now that you’ve specified the relevant metadata for this extraction (because you haven’t actually really extracted anything from the source yet…), you can start with the actual entities to extract.

This part starts with the following lines:

##############################################################################
##################################################### START: source (ROOT) ###
##############################################################################
rxsObject <- data.tree::Node$new('source');
currentEntity <- rxsObject;
##############################################################################

You can ignore this: it just shows when the metadata (the source identifier and extractor identifier) end and the real entities to extract start.

This start also has a corresponding block all the way at the bottom, just before the closing comment:

##############################################################################
####################################################### END: source (ROOT) ###
##############################################################################
class(rxsObject) <- c('rxs', 'rxsObject', class(rxsObject));
rxsObject$rxsMetadata <- list(rxsVersion='0.3.0', moduleId=NULL, id=uniqueSourceIdentifier, extractorId=extractorIdentifier);
##############################################################################
##############################################################################
##############################################################################

In between these two blocks, you specify the values for the actual entities.

In principle, this process is relatively simple: you just scroll further down in the Rxs file, and for every entity block, you specify in the block core whatever is explained for that entity and the required value.

There are three more patterns that you’ll likely encounter, so let’s look at those first.

12.3.5 Entity containers

Entities are usually organized hierarchically (in a nested, tree-like structure). This is useful because the point of using Rxs files is that it’s easy to combine extracted entities from multiple files for the same source. This way, many people can easily collaborate on the same database of machine-readable literature. You can even collaborate with “future you”: it’s easy to first do a relatively superficial extraction pass, and later on specify more detailed entities in another Rxs specification, extract those, and then combine all data in one database.

Entity containers are entities that are themselves not extractable, but that just exist to contain other entities. Common entity containers are, for example, “General”, “Methods”, and “Results”. Every time you encounter a container entity, the blocks of hashes indent by two spaces.

This indenting starts immediately after the “source root” container has opened:

##############################################################################
##################################################### START: source (ROOT) ###
##############################################################################
rxsObject <- data.tree::Node$new('source');
currentEntity <- rxsObject;
##############################################################################

  ############################################################################
  ######################################################### START: general ###
  ############################################################################
  currentEntity <- currentEntity$AddChild('general');
  ############################################################################
  ### 
  ### GENERAL
  ### 
  ### General information about the article
  ### 
  ############################################################################

In this example, the source root contains an entity container with entity identifier “general”. This entity container has label “GENERAL” and description “General information about the article”.

This entity container does not itself contain a corresponding value: instead, the Rxs file indents again and an entity block (for the entity with identifier “qurid”) is shown:

    ##########################################################################
    ######################################################### START: qurid ###
    ##########################################################################
    currentEntity <- currentEntity$AddChild('qurid');
    currentEntity[['value']] <-
    ##########################################################################
    ### 
    ### QURID
    ### 
    ### Quasi-Unique Record Identifier. We will use this to
    ### supplement this information with information from the
    ### bibliographic databases (i.e. the screening database).
    ### 
    ##########################################################################
        
        NA
        
    ##########################################################################
    ####################################### VALUE DESCRIPTION AND EXAMPLES ###
    ##########################################################################
    ### 
    ### A single character value
    ### 
    ### EXAMPLES:
    ### 
    ### "Example"
    ### "Another example"
    ### 
    ##########################################################################
    currentEntity[['validation']] <- expression(is.na(VALUE) || (is.character(VALUE) && length(VALUE) == 1));
    currentEntity <- currentEntity$parent;
    ##########################################################################
    ########################################################### END: qurid ###
    ##########################################################################

Further down, this entity container ends with a block end marker for this entity container (which had “general” as entity identifier):

  ############################################################################
  ############################################################################
  currentEntity <- currentEntity$parent;
  ############################################################################
  ########################################################### END: general ###
  ############################################################################

12.3.6 Clustering entities

Often, you will want to extract multiple closely related values; or you want to extract something that can have many different forms in a source, for example an effect size, where you need to know what you’re extracting exactly. In those situations, you usually use “clustering entities” or “list entities”.

The look like an entity where you don’t extract just one value, but multiple labelled values. An example is shown below:

  ############################################################################
  ############################################# START: authors (REPEATING) ###
  ############################################################################
  currentEntity <- currentEntity$AddChild('authors__1__');
  currentEntity[['value']] <-
  ############################################################################
  ### 
  ### AUTHORS
  ### 
  ### Information about each author. Note that authors are repeating;
  ### therefore, copy the list below multiple times if there are
  ### multiple authors. Fill it out in the order of authorship.
  ### 
  ############################################################################
      
      list(authorId = NA,            ### Author identifier: A unique identifier for this author in this source; most likely, this author's last name suffices. If multiple authors share a last name, number them (e.g. "smith1", "smith2", etc). [Examples: "example1"; "example_2"] [Value description: A single character value that must start with a letter and can only contain alphanumeric characters and underscores]
           authorName = NA,          ### Author name: The full name of this author. [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorORCID = NA,         ### Author ORCID: The author's ORCID, if available or findable. If not, enter "nr". [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorAffiliation = NA);  ### Author affiliation: The author's affiliations as a vector of strings. Each element should be one affiliation as listed on the article. Affiliations are ideally specified as RORs with the format "https://ror.org/019wvm592", where the part after the last slash is the ROR, but the first part of the URL is included, as well (search for an affiliation's ROR using https://ror.org/search. If no ROR can be found, type in the author institurion manually. If no author institution can be found, specify "nr". [Examples: c("First value", "Second value")] [Value description: A character vector (i.e. one or more strings)]
      
  ############################################################################
  currentEntity[['validation']] <- list(`authorName` = expression(is.na(VALUE) || (is.character(VALUE) && length(VALUE) == 1)),
                                        `authorORCID` = expression(is.na(VALUE) || (is.character(VALUE) && length(VALUE) == 1)),
                                        `authorAffiliation` = expression(is.na(VALUE) || (is.character(VALUE))));
  currentEntity$name <- metabefor::nodeName(currentEntity$value[[1]], "authors__1__");
  currentEntity <- currentEntity$parent;
  ############################################################################
  ############################################### END: authors (REPEATING) ###
  ############################################################################

Let’s take a closer look at the block core in this entity block:

      list(authorId = NA,            ### Author identifier: A unique identifier for this author in this source; most likely, this author's last name suffices. If multiple authors share a last name, number them (e.g. "smith1", "smith2", etc). [Examples: "example1"; "example_2"] [Value description: A single character value that must start with a letter and can only contain alphanumeric characters and underscores]
           authorName = NA,          ### Author name: The full name of this author. [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorORCID = NA,         ### Author ORCID: The author's ORCID, if available or findable. If not, enter "nr". [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorAffiliation = NA);  ### Author affiliation: The author's affiliations as a vector of strings. Each element should be one affiliation as listed on the article. Affiliations are ideally specified as RORs with the format "https://ror.org/019wvm592", where the part after the last slash is the ROR, but the first part of the URL is included, as well (search for an affiliation's ROR using https://ror.org/search. If no ROR can be found, type in the author institurion manually. If no author institution can be found, specify "nr". [Examples: c("First value", "Second value")] [Value description: A character vector (i.e. one or more strings)]

We see that this block core consists of “list()”, in between those parentheses, there is a list of three entity identifiers, each followed by an equals sign, “NA”, and then, after some spaces, three hashes (###), the entity labels and description, and between square brackets (“[” and “]”), examples and value descriptions.

The default values for these three entities is “NA”, which stands for “Not Applicable” (but you can read it as “Not Extracted Yet”. When you extract these three entities, you still have to supply the double quotes yourself. Once you extracted this clustering entity (or list entity), this core block might look like this:

      list(authorId = "viechtbauer",            ### Author identifier: A unique identifier for this author in this source; most likely, this author's last name suffices. If multiple authors share a last name, number them (e.g. "smith1", "smith2", etc). [Examples: "example1"; "example_2"] [Value description: A single character value that must start with a letter and can only contain alphanumeric characters and underscores]
           authorName = "Wolfgang Viechtbauer",          ### Author name: The full name of this author. [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorORCID = "0000-0003-3463-4063",         ### Author ORCID: The author's ORCID, if available or findable. If not, enter "nr". [Examples: "Example"; "Another example"] [Value description: A single character value]
           authorAffiliation = "02jz4aj89");  ### Author affiliation: The author's affiliations as a vector of strings. Each element should be one affiliation as listed on the article. Affiliations are ideally specified as RORs with the format "https://ror.org/019wvm592", where the part after the last slash is the ROR, but the first part of the URL is included, as well (search for an affiliation's ROR using https://ror.org/search [Examples: c("First value", "Second value")] [Value description: A character vector (i.e. one or more strings)]

Clustering entities (or list entities, whatever you prefer to call them) are in the end just convenient ways to more quickly extract closely related entities.

12.3.7 Repeating entities

The final common pattern you may encounter are repeating entities. Repeating entities are used in situations where during the planning of the systematic review, you cannot be sure how often a given entity will occur in a source. This can be the case, for example, with samples in a study; or with authors; or with countries where data were collected; or many other entities.

The solution is relatively simple: you just copy-paste the entity block for a repeating entity. If you look back, you saw that the entity block for entity “authors” was repeating. You can see this by inspecting the block start marker and the block end marker.

The block start marker was:

  ############################################################################
  ############################################# START: authors (REPEATING) ###
  ############################################################################

The block end marker was:

  ############################################################################
  ############################################### END: authors (REPEATING) ###
  ############################################################################

The text “(repeating)” after the entity identifier tells you that this is a repeating entity.

Therefore, if this source has multiple authors (and most sources do), you copy the entire block (including the block start marker and the block end marker) and you paste it right below (with an empty line in between so you can easily spot where one entity block ends and the next one starts.

If you forget to specify a valid identifier for the first entity in a repeating list entity, {metabefor} will throw an error when you try to “knit” or “render” the Rxs file (using CTRL-SHIFT-K).

For example, in this case it would say something like:

Quitting from lines 15-883 [rxs-extraction-chunk] (extractionScriptTemplate.rxs.Rmd)
                                                                                                                                    
Error in `metabefor::nodeName()`:
! 
---------- metabefor error, please read carefully:

As an identifier for this entity (with temporary name
'authors__1__'), you specified `NA` (you probably forgot to
specify an identifier). Please change it to a valid entity
identifier!

Identifiers can only consist of (lower or uppercase) Latin
letters [a-zA-Z], Arabic numerals [0-9], and underscores
[_], and always have to start with a letter (as a regular
expression: ^[a-zA-Z][a-zA-Z0-9_]*$).

----------

Backtrace:
 1. metabefor::nodeName(currentEntity$value[[1]], "authors__1__")
Execution halted

This error shows up in the R console in RStudio, by default located in the bottom-left corner of RStudio.

In this case, fix that identifier and try again.

You then specify the values for the second occurrence of this entity (in this case, for the second author). This way, you are able to extract as many repetitions of this entity as the source may contain.

If you’re done, hit CTRL-SHIFT-K again to “knit” or “render” the Rxs file. If everything goes well, you should see something similar to what you see at https://metabefor.opens.science/articles/validation.html.

In that case, check whether every entity validates succesfully. If so, check the table at the bottom and make sure that all values as they show up there are correct. These are the values as they are imported, so if they show up correctly here, you know that everything went well. If not, correct whatever is not going well.

Congratulations - you extracted a source, made a little bit of the literature machine-readable, and so contributed to scientific progress and a better world! 👍🙏