In this notebook I reproduce all the calculations, tables and figures from the paper “Human Rights and the Grammar of Corporate Social Responsibility.” The notebook does not reproduce the paper as a whole but ought to allow the paper’s readers to understand how the corpus was manipulated and reproduced.

Libraries

R is designed to exploit a large population of libraries, each designed around a specific purpose. This paper relies heavily on libraries in the Tidyverse. Tidyverse libraries are designed to promote a specific approach to programming in R.1 I also rely on the Quanteda and Tidytext libraries. These two libraries’ authors have been at the fore in developing R’s potential for text mining.2

CollocateR

The CollocateR library was written to support the research contained in this paper. It will be available on github once the anonymous review process has finished. For now, you can download it as a zip file from the Supplementary materials, unzip and install using Devtools.

Load libraries

First, The following script will load the required libraries, installing any that are not already installed on your system. The ‘devtools’ library is used for installing local libraries. ‘Knitr’ and ‘kableExtra’ are used for publishing tables. ‘Cowplot’ is used to prepare figures.

# Install the following libraries if not already installed.
sapply(c("tidyverse", "tidytext", "quanteda", "devtools", "knitr", "kableExtra", "cowplot", "cleanNLP", "gridExtra"),
       function(x)
             if(!(x %in% rownames(installed.packages()))){install.packages(x)})
lapply(c("tidyverse", "tidytext", "quanteda", "devtools", "knitr", "kableExtra", "cowplot", "cleanNLP", "gridExtra"),
       require, 
       character.only = TRUE)

if(!("collocateR" %in% rownames(installed.packages()))){ # To save time, check if installed (downloading to install locally)
    url <- "https://github.com/cokelly/collocateR/archive/master.zip"
    download.file(url = url, destfile = "collocater.zip")
    unzip("collocater.zip", exdir = "collocateR")
    devtools::install("collocateR/collocateR-master")
}
library(collocateR)

Load the Data

In this section I load two versions of the data, one complete and one with ‘stopwords’ and ‘lemmas’ removed. In order to load the data from scratch, please refer to the ‘GetDocs.R’ file. Otherwise both corpuses are contained here in ‘RDS’ files, which act as containers for r data. I then draw up some summary statistics. The remainder of this workbook draws in R scripts as functions in order to reproduce all the tables and figures in the paper.

docs <- readRDS("../data/docs.RDS")
docs_nostops <- readRDS("../data/docs_nostops.RDS") # This loads a version of the reports with 'stopwords' and lemmas removed.

Summary statistics

File names

The filenames for each report contains relevant details about the report. The file ‘meta.RDS’ contains the data in tabular form. The first five objects/rows for each are printed below:

head(names(docs))
## [1] "Energy_BRA_Petrobas_2007.pdf" "Energy_BRA_Petrobas_2008.pdf"
## [3] "Energy_BRA_Petrobas_2009.pdf" "Energy_BRA_Petrobas_2010.pdf"
## [5] "Energy_BRA_Petrobas_2011.pdf" "Energy_BRA_Petrobas_2012.pdf"
meta <- readRDS("../data/metadata.RDS")
head(meta)
## # A tibble: 6 x 6
##    year country company  sector filename                   `corpus positio~
##   <dbl> <chr>   <chr>    <chr>  <chr>                                 <int>
## 1  2007 BRA     Petrobas Energy Energy_BRA_Petrobas_2007.~                1
## 2  2008 BRA     Petrobas Energy Energy_BRA_Petrobas_2008.~                2
## 3  2009 BRA     Petrobas Energy Energy_BRA_Petrobas_2009.~                3
## 4  2010 BRA     Petrobas Energy Energy_BRA_Petrobas_2010.~                4
## 5  2011 BRA     Petrobas Energy Energy_BRA_Petrobas_2011.~                5
## 6  2012 BRA     Petrobas Energy Energy_BRA_Petrobas_2012.~                6
# To export a table, run
# kable(head(meta), format = "html")
# calculate the word length
wordcount <- docs %>%  
      as_tibble %>%
      unnest_tokens(., word, value) %>% 
      nrow 

wordcount_nostops <- docs_nostops %>%  
      as_tibble %>%
      unnest_tokens(., word, value) %>% 
      nrow 

# Calculate the number of reports, the number of companies and the earliest and latest years.
numreps <- nrow(meta)
numcomps <- length(unique(meta$company))
numcountries <- length(unique(meta$country))
firstyear <- min(meta$year)
lastyear <- max(meta$year)

The corpus is 13,073,963 words long. With stopwords removed it is 8,602,056 words long.

The dataset consists of 347 corporate social responsibility reports published by 36 global mining and energy firms from 10 states between 1998 and 2017.

Table 1

Table 1 is a ‘keyword in context’ table from BG Group’s 2015 report. I use the version that includes stopwords here and publish rows 6-16 (where more narrative passages begin). The table is formulated using Quanteda’s ‘kwic’ function:

# Isolate BG Group's 2010 report

bg_2015_meta <- meta %>% filter(company == "BG" & year == "2015")

bg_2015 <- docs[bg_2015_meta$`corpus position`]

# Extract kwics
table1 <- kwic(x = bg_2015, pattern = "rights", window = 6) %>%
      as_tibble %>%
      select(pre, keyword, post) %>%
      slice(6:16)

# To generate a publication-ready table
#kable(table1, format = "html", caption = "KWICs for Rights, BG Group Sustainability Report 2015") %>%
#      kable_styling("striped", full_width = TRUE)

table1
## # A tibble: 11 x 3
##    pre                              keyword post                           
##    <chr>                            <chr>   <chr>                          
##  1 has a commitment to respect hum~ rights  in line with the un guiding    
##  2 guiding principles on business ~ rights  we have an integrated approach~
##  3 have an integrated approach to ~ rights  across our business to reflect~
##  4 priorities in 2015 managed pote~ rights  issues related maintained our ~
##  5 voluntary principles on securit~ rights  in myanmar we conducted a human
##  6 in myanmar we conducted a human  rights  bg published a public position~
##  7 our approach impact assessment ~ rights  activities and performance int~
##  8 performance integrated approach~ rights  across the business and manage~
##  9 the business and manage potenti~ rights  issues through our existing in~
## 10 security and social performance~ rights  taskforce made up of senior re~
## 11 to provide assurance that poten~ rights  impacts were identified and ad~

Table 2

In table 2 I use the corpus with lemmas and stopwords removed. I isolate BG Group reports between 2010 and 2015, calculate word frequencies for trigram words, calculate normalised pointwise mutual information scores, combine the two tables and sort by npmi.

# Isolate BG Group reports 2010-2015
bg_2010_2015_meta <- meta %>% filter(company == "BG" & year > "2009" & year < 2016)

bg_2010_2015 <- docs_nostops[bg_2010_2015_meta$`corpus position`]

# Note below "human right": "rights" has been lemmatied. Note also 'window = 1' gives us trigrams (as in words directly neighbouring "human right"). In the final table I slice out row one because it is 'human right' itself.

bg_trigram_freqs <- collocateR::get_freqs(bg_2010_2015, "human right", window = 1)
bg_trigram_npmi <- collocateR::get_npmi(bg_2010_2015, "human right", window = 1)

table2 <- bg_trigram_freqs %>% 
      full_join(., bg_trigram_npmi, by = "ngram") %>% 
      arrange(desc(npmi)) %>% 
      slice(2:11) %>% 
      add_column(rank = 1:10) %>% 
      select(rank, trigram = ngram,
             `trigram recurrence` = `Collocate Frequency`,
             `word recurrence` = `Document Frequency`,
             npmi)

# To generate a publication-ready table
#kable(table2, format = "html", caption = "Top 10 Word frequencies for trigrams with human rights, BG sustainability reports 2010-2015, with lemmas and stopwords removed, sorted by npmi") %>% 
#      kable_styling("striped", full_width = TRUE)
table2
## # A tibble: 10 x 5
##     rank trigram      `trigram recurrence` `word recurrence`  npmi
##    <int> <chr>                       <int>             <int> <dbl>
##  1     1 longstanding                    2                 3 0.534
##  2     2 vpshr                          12                48 0.531
##  3     3 security                       30               282 0.487
##  4     4 respect                         7                36 0.476
##  5     5 institute                       8                65 0.435
##  6     6 hsse                            9                80 0.431
##  7     7 integrate                       5                45 0.404
##  8     8 business                       33               714 0.390
##  9     9 risk                           27               547 0.389
## 10    10 hr                              2                17 0.376

Table 3

Table 3 compares collocates for three sets of words, using the corpus with stopwords and lemmas removed. In order to do this I convert larger words (“environmental”) into shorter (“environment”) and count them as one.

# combine longer words into short
docs2 <- docs_nostops %>% 
      str_replace_all(., "environmental", "environment") %>%
      str_replace_all(., "safety", "safe")
# Get npmis for all the key terms
hr <- get_npmi(docs2, "human right") %>% slice(2:11) %>% select(`human rights` = ngram)
env <- get_npmi(docs2, "environment") %>% slice (2:11) %>% select(`environment/environmental` = ngram)
health <- get_npmi(docs2, "health") %>% slice(2:11) %>% select(`health` = ngram)
saf <- get_npmi(docs2, "safe") %>% slice(2:11) %>% select(`safe/safety` = ngram)
# Combine into table 3
table3 <- tibble(rank = 1:10) %>% bind_cols(hr) %>% bind_cols(env) %>% bind_cols(health) %>% bind_cols(saf)

# To generate a publication-ready table
#kable(table3, format = "html", caption = "Table 3: Words neighbouring “human rights” compared to words neighbouring either “environmental” or “environment,” to words neighbouring either “safety” or “safe”, and to words neighbouring “health”, full corpus.") %>% 
#      kable_styling("striped", full_width = TRUE)
table3
## # A tibble: 10 x 5
##     rank `human rights` `environment/environmenta~ health     `safe/safety`
##    <int> <chr>          <chr>                      <chr>      <chr>        
##  1     1 security       protection                 safe       health       
##  2     2 universal      impact                     occupatio~ occupational 
##  3     3 respect        social                     resort     environment  
##  4     4 declaration    safe                       hygiene    intrinsic    
##  5     5 salient        health                     environme~ workplace    
##  6     6 voluntary      stewardship                care       culture      
##  7     7 principle      performance                wellbeing  healthy      
##  8     8 proclaim       friendly                   wellness   performance  
##  9     9 complicit      beyond                     zapolyarye hse          
## 10    10 vpshr          management                 workplace  road

Figures 1 & 2

Figures 1 and 2 are more or ordinary bar charts. Note that I filter out a number of companies and group by period rather than year. The formula below also prepares data for figure 3.

# Filter docs by year and remove late-comer companies
meta_fig123 <- meta %>% filter(company != "EDF" & company != "Freeport-McMoran" & company != "Ecopetrol" & company != "Glencore" & company != "Randgold" & company != "Mosaic" & company != "Newmont" & company != "Occidental" & company != "Schlumberger" & company != "China Coal Energy")

docs_fig123 <- docs[meta_fig123$`corpus position`]

# Count UNGC, UNGPs, VPSHR mentions

mentions <- meta_fig123 %>%
      # get UNGC and UNGP mentions
      add_column(`UNGC mentions` = (str_count(docs_fig123, coll("global compact")) + str_count(docs_fig123, coll("united nations global compact"))), .before = 1) %>% 
      add_column(`UNGP mentions` = (str_count(docs_fig123, coll("united nations guiding principles")) + str_count(docs_fig123, coll("guiding principles on business"))), .before = 2) %>%
      add_column(`VPSHR mentions` = (str_count(docs_fig123, coll("vpshr")) + str_count(docs_fig123, coll("voluntary principles on security and human rights"))), .before = 2) %>%
      # Get wordcount
      add_column(`wordcount` = str_count(docs_fig123, "\\S+")) %>%
      # Years to periods
      mutate(period = case_when(year >= 1998 & year <= 2010 ~ "1998-2010", year >= 2011 ~ "2011-2015")) %>%
      #filter(country == "UK") %>%
      # Group by company and period
      group_by(company, period) %>%
      # Add up mentions and word count cols
      summarise_at(vars(`UNGC mentions`, `UNGP mentions`, `VPSHR mentions`, wordcount), sum) %>%
      # Normalised mentions
      mutate(`Normalised_UNCG_mentions` = `UNGC mentions`/wordcount) %>%
      mutate(`Normalised_UNGP_mentions` = `UNGP mentions`/wordcount) %>%
      mutate(`Normalised_VPHSR_mentions` = `VPSHR mentions`/wordcount) %>%
      ungroup %>%
      mutate_if(is.character, as.factor) 
# Count overall mentions
mentions_of_ungc <- sum(mentions$`UNGC mentions`)
mentions_of_ungps <- sum(mentions$`UNGP mentions`)
mentions_of_vpshr <- sum(mentions$`VPSHR mentions`)

# Plot by company and period

ungc_mentions.p <- ggplot(mentions, aes(company, Normalised_UNCG_mentions)) +
      geom_col() + 
      facet_grid(period ~ .) +
      background_grid(major = "xy", minor = "none") +
      #labs(title = "Figure 1: Mentions of the UN Global Compact", subtitle = "by company and period") +
      ylab("Mentions \n normalised by document lengths") +
      theme(axis.text.x = element_text(angle = 90, vjust = -0.01),
            axis.text.y = element_blank(),
            axis.title.y = element_text(size = 10),
            axis.title.x = element_blank())

ungp_mentions.p <- ggplot(mentions, aes(company, Normalised_UNGP_mentions)) +
      geom_col() +
      facet_grid(period ~ .) +
      background_grid(major = "xy", minor = "none") +
      #labs(title = "Figure 2: Mentions of the UN Guiding Principles", subtitle = "by company and period") +
      ylab("Mentions \n normalised by document lengths") +
      theme(axis.text.x = element_text(angle = 90, vjust = -0.01),
            axis.text.y = element_blank(),
            axis.title.y = element_text(size = 10),
            axis.title.x = element_blank())

ungc_mentions.p

ungp_mentions.p

Cowplot provides a simple function for saving plots.

ggsave("UNGC_mentions.png", ungc_mentions.p, dpi = 600, width = 14.82, height = 10.19, units = c("cm"))

ggsave("UNGPs_mentions.png", ungp_mentions.p, dpi = 600, width = 14.82, height = 10.19, units = c("cm"))

Table 4

Table 4 simply calculates NPMI for “global compact” within BG Group’s reports.

bg_locs <- meta %>% filter(company == "BG")

table4 <- unlist(docs_nostops[bg_locs$`corpus position`]) %>% 
      get_npmi(., "global compact", ngram = 1) %>%
      slice(2:21) %>% select(collocate = ngram) %>% add_column(rank = 1:20, .before = 1)

# To generate a publication-ready table
#kable(table4, format = "html", caption = "Table 4: Top 20 collocates with ‘Global Compact’, BG Group, ranked by npmi.") %>%
#      kable_styling("striped", full_width = TRUE)
table4
## # A tibble: 20 x 2
##     rank collocate  
##    <int> <chr>      
##  1     1 un         
##  2     2 ungc       
##  3     3 reaffirm   
##  4     4 ten        
##  5     5 iogp       
##  6     6 unite      
##  7     7 signatory  
##  8     8 sign       
##  9     9 nation     
## 10    10 gri        
## 11    11 declaration
## 12    12 universal  
## 13    13 gl         
## 14    14 fulfil     
## 15    15 toxic      
## 16    16 ipieca     
## 17    17 remain     
## 18    18 dnv        
## 19    19 principle  
## 20    20 commit

Figure 3

Following from Figures 1 and 2 above.

vpshr_mentions.p <- ggplot(mentions, aes(company, Normalised_VPHSR_mentions)) +
      geom_col() +
      facet_grid(period ~ .) +
      background_grid(major = "xy", minor = "none") +
      #labs(title = "Figure 3: Mentions of the Voluntary Principles on Security and Human Rights", subtitle = "by company and period") +
      ylab("Mentions \n normalised by document lengths") +
      theme(axis.text.x = element_text(angle = 90, vjust = -0.01),
            axis.text.y = element_blank(),
            axis.title.y = element_text(size = 10),
            axis.title.x = element_blank())
vpshr_mentions.p

Table 5

Table 5 is perhaps the most computationally complex table in the paper. I draw on a, SpacY (Arnold, 2017 link; Honnibal and Johnson, 2015 link), a Python package through the cleanNLP R library in order to isolate verbs within the ranking of collocates for the UN Global Compact. From there the table is straightforward, but the process is time-consuming. On a Windows pc, Visual Studio C++ build tools are also required.

# Merge "ungc" and "global compact" and "united nations global compact" and "un global compact"

docs_verbs <- docs_nostops %>%
      str_replace_all(., "united nations global compact", "ungc") %>%
      str_replace_all(., "un global compact", "ungc") %>%
      str_replace_all(., "global compact", "ungc") %>%
      str_replace_all(., "unglobalcompact", "ungc") %>%
      str_replace_all(., "voluntary principles on security and human rights", "vpshr")

ungc_npmi <- get_npmi(docs_verbs, "ungc")
vpshr_npmi <- get_npmi(docs_verbs, "vpshr")

ungc_words <- ungc_npmi %>% # create a character vector of ngrams to categorise through CleanNLP 
      select(ngram) %>% 
      unlist %>% 
      as.character %>%
      paste(., sep = " ", collapse = " ")

vpshr_words <- vpshr_npmi %>%
      select(ngram) %>%
      unlist %>%
      as.character %>%
      paste(., sep = " ", collapse = " ")
# First, initialise SpaCy in order to use with cleanNLP (you may need to install SpaCy using "pip install spacy", in Windows from an administrator console. More information here: https://spacy.io/).
cnlp_init_spacy()
## Loading required namespace: reticulate
# Generate tokens
writeLines(ungc_words, tfungc <- tempfile())
anno_ungc <- cnlp_annotate(tfungc)
# Generate table by creating tibble of categorised tokens, isolating verbs and left joining with npmi scores
table5_ungc_raw <- cnlp_get_token(anno_ungc) %>% # Tokenise
      filter(upos == "VERB") %>% # Isolate "universal parts of speech" classification as verbs
      select(ngram = word) %>% # Select the ngrams. The cleanNLP package also delivers lemmas but the corpus had already been lemmatised  
      left_join(., ungc_npmi, by = "ngram") # (Re)join with npmi score
# And again for teh vpshr
writeLines(vpshr_words, tfvpshr <- tempfile())
anno_vpshr <- cnlp_annotate(tfvpshr)
# Generate table by creating tibble of categorised tokens, isolating verbs and left joining with npmi scores
table5_vpshr_raw <- cnlp_get_token(anno_vpshr) %>% # Tokenise
      filter(upos == "VERB") %>% # Isolate "universal parts of speech" classification as verbs
      select(ngram = word) %>% # Select the ngrams. The cleanNLP package also delivers lemmas but the corpus had already been lemmatised  
      left_join(., vpshr_npmi, by = "ngram") # (Re)join with npmi score

Note that the resulting table is not very tidy. The NLP algorithm returns a lot of acronyms as verbs (also words like ‘danish’). I have to select complete or near complete words manually to complete Table5.

table5_left <- table5_ungc_raw %>% 
      filter(., ngram == "disappear" | ngram == "join" | ngram == "disclo" | ngram == "found" | ngram == "assure" | ngram == "participate" | ngram == "submit" | ngram == "rotate" | ngram == "reflected" | ngram == "indica" | ngram == "disclose" | ngram == "invite" | ngram == "illustrate" | ngram == "lead" | ngram == "appear" | ngram == "observe" | ngram == "articulate" | ngram == "promote" | ngram == "harmonize" | ngram == "initiate" | ngram == "link" | ngram == "stockpile" | ngram == "serve" | ngram == "find" | ngram == "belong" | ngram == "continue" | ngram == "become" | ngram == "describe" | ngram == "steer" | ngram == "set" | ngram == "hope" | ngram == "remain" | ngram == "include" | ngram == "draw" | ngram == "seek" | ngram == "declare" | ngram == "do" | ngram == "communicate" | ngram == "ask" | ngram == "accept" | ngram == "verify" | ngram == "stimulate" | ngram == "keep" | ngram == "reflect" | ngram == "call" | ngram == "govern" | ngram == "expand" | ngram == "strive" | ngram == "embed" | ngram == "implement" | ngram == "adopt" | ngram == "believe" | ngram == "break" | ngram == "look" | ngram == "stand" | ngram == "intend" | ngram == "prepare" | ngram == "indicate" | ngram == "realize" | ngram == "involve" | ngram == "obtain" | ngram == "rely" | ngram == "relate" | ngram == "live" | ngram == "represent" | ngram == "make" | ngram == "gather" | ngram == "provide" | ngram == "emerge" | ngram == "establish" | ngram == "take" | ngram == "pursue" | ngram == "expect" | ngram == "achieve" | ngram == "adapt" | ngram == "apply" | ngram == "read" | ngram == "see" | ngram == "strengthen" | ngram == "put" | ngram == "identify" | ngram == "protect" | ngram == "learn" | ngram == "surround" | ngram == "vary" | ngram == "ensure" | ngram == "develop" | ngram == "build" | ngram == "deliver" | ngram == "carry" | ngram == "finance" | ngram == "enter" | ngram == "add" | ngram == "affect" | ngram == "certify" | ngram == "invest" | ngram == "enable" | ngram == "reach" | ngram == "assess" | ngram == "centre" | ngram == "begin" | ngram == "undertake" | ngram == "enhance" | ngram == "need" | ngram == "distribute" | ngram == "generate" | ngram == "agree" | ngram == "go" | ngram == "hold" | ngram == "may" | ngram == "project" | ngram == "maintain" | ngram == "collect" | ngram == "improve" | ngram == "create" | ngram == "consider" | ngram == "operate" | ngram == "come" | ngram == "manage" | ngram == "give" | ngram == "be" | ngram == "know" | ngram == "mean" | ngram == "grow" | ngram == "allow" | ngram == "exclude" | ngram == "start" | ngram == "perform" | ngram == "cost" | ngram == "have" | ngram == "receive" | ngram == "must" | ngram == "offer" | ngram == "reduce")

table5_right <- table5_vpshr_raw %>%
      filter(., ngram == "implement" | ngram =="read" | ngram =="promote" | ngram =="join" | ngram =="enter" | ngram =="seek" | ngram =="submit" | ngram =="agree" | ngram =="expect" | ngram =="protect" | ngram =="undertake" | ngram =="avoid" | ngram =="identify" | ngram =="carry" | ngram =="set" | ngram =="become" | ngram =="contain" | ngram =="reflect" | ngram =="introduce" | ngram =="develop" | ngram =="come" | ngram =="engage" | ngram =="provide" | ngram =="include" | ngram =="prepare" | ngram =="establish" | ngram =="find" | ngram =="begin" | ngram =="ensure" | ngram =="support" | ngram =="apply" | ngram =="maintain" | ngram =="integrate" | ngram =="require" | ngram =="understand" | ngram =="see" | ngram =="use" | ngram =="can" | ngram =="aim" | ngram =="create" | ngram =="continue" | ngram =="operate" | ngram =="follow" | ngram =="manage" | ngram =="improve" | ngram =="increase")

This makes the basis of Table 5, with some manual editing to complete words (so “disclo” to “disclose”) etc.

To view the words that were removed, try:

table5_Left_removed <- table5_ungc_raw %>% 
      filter(., ngram != "disappear" & ngram != "join" & ngram != "disclo" & ngram != "found" & ngram != "assure" & ngram != "participate" & ngram != "submit" & ngram != "rotate" & ngram != "reflected" & ngram != "indica" & ngram != "disclose" & ngram != "invite" & ngram != "illustrate" & ngram != "lead" & ngram != "appear" & ngram != "observe" & ngram != "articulate" & ngram != "promote" & ngram != "harmonize" & ngram != "initiate" & ngram != "link" & ngram != "stockpile" & ngram != "serve" & ngram != "find" & ngram != "belong" & ngram != "continue" & ngram != "become" & ngram != "describe" & ngram != "steer" & ngram != "set" & ngram != "hope" & ngram != "remain" & ngram != "include" & ngram != "draw" & ngram != "seek" & ngram != "declare" & ngram != "do" & ngram != "communicate" & ngram != "ask" & ngram != "accept" & ngram != "verify" & ngram != "stimulate" & ngram != "keep" & ngram != "reflect" & ngram != "call" & ngram != "govern" & ngram != "expand" & ngram != "strive" & ngram != "embed" & ngram != "implement" & ngram != "adopt" & ngram != "believe" & ngram != "break" & ngram != "look" & ngram != "stand" & ngram != "intend" & ngram != "prepare" & ngram != "indicate" & ngram != "realize" & ngram != "involve" & ngram != "obtain" & ngram != "rely" & ngram != "relate" & ngram != "live" & ngram != "represent" & ngram != "make" & ngram != "gather" & ngram != "provide" & ngram != "emerge" & ngram != "establish" & ngram != "take" & ngram != "pursue" & ngram != "expect" & ngram != "achieve" & ngram != "adapt" & ngram != "apply" & ngram != "read" & ngram != "see" & ngram != "strengthen" & ngram != "put" & ngram != "identify" & ngram != "protect" & ngram != "learn" & ngram != "surround" & ngram != "vary" & ngram != "ensure" & ngram != "develop" & ngram != "build" & ngram != "deliver" & ngram != "carry" & ngram != "finance" & ngram != "enter" & ngram != "add" & ngram != "affect" & ngram != "certify" & ngram != "invest" & ngram != "enable" & ngram != "reach" & ngram != "assess" & ngram != "centre" & ngram != "begin" & ngram != "undertake" & ngram != "enhance" & ngram != "need" & ngram != "distribute" & ngram != "generate" & ngram != "agree" & ngram != "go" & ngram != "hold" & ngram != "may" & ngram != "project" & ngram != "maintain" & ngram != "collect" & ngram != "improve" & ngram != "create" & ngram != "consider" & ngram != "operate" & ngram != "come" & ngram != "manage" & ngram != "give" & ngram != "be" & ngram != "know" & ngram != "mean" & ngram != "grow" & ngram != "allow" & ngram != "exclude" & ngram != "start" & ngram != "perform" & ngram != "cost" & ngram != "have" & ngram != "receive" & ngram != "must" & ngram != "offer" & ngram != "reduce")

table5_right_removed <- table5_vpshr_raw %>%
      filter(., ngram != "implement" & ngram !="read" & ngram !="promote" & ngram !="join" & ngram !="enter" & ngram !="seek" & ngram !="submit" & ngram !="agree" & ngram !="expect" & ngram !="protect" & ngram !="undertake" & ngram !="avoid" & ngram !="identify" & ngram !="carry" & ngram !="set" & ngram !="become" & ngram !="contain" & ngram !="reflect" & ngram !="introduce" & ngram !="develop" & ngram !="come" & ngram !="engage" & ngram !="provide" & ngram !="include" & ngram !="prepare" & ngram !="establish" & ngram !="find" & ngram !="begin" & ngram !="ensure" & ngram !="support" & ngram !="apply" & ngram !="maintain" & ngram !="integrate" & ngram !="require" & ngram !="understand" & ngram !="see" & ngram !="use" & ngram !="can" & ngram !="aim" & ngram !="create" & ngram !="continue" & ngram !="operate" & ngram !="follow" & ngram !="manage" & ngram !="improve" & ngram !="increase")

To complete Table 5:

table5 <- table5_left[1:10,1] %>% bind_cols(table5_right[1:10,1]) %>% rename(UNGC = ngram) %>% rename(VSHR = ngram1) %>% add_column(rank = 1:10, .before = 1) %>% add_column(rank1 = 1:10, .before = 3)
table5
## # A tibble: 10 x 4
##     rank UNGC        rank1 VSHR     
##    <int> <chr>       <int> <chr>    
##  1     1 disappear       1 implement
##  2     2 join            2 read     
##  3     3 disclo          3 promote  
##  4     4 found           4 join     
##  5     5 assure          5 enter    
##  6     6 participate     6 seek     
##  7     7 submit          7 submit   
##  8     8 rotate          8 agree    
##  9     9 reflected       9 expect   
## 10    10 indica         10 protect

Figure 4

Figure 4 is a ‘lexical dispersion plot.’ Such plots are used to illustrate how keywords are distributed across bodies of literature. Each row represents a specific item in a corpus, normalised for length. In this case I aim to show how patterns of ‘risk’ and ‘human rights’ mentions differ.

meta_BG <- meta %>% filter(company == "BG")

docs_BG <- docs[meta_BG$`corpus position`]

names(docs_BG) <- paste("BG", meta_BG$year)

BG_hr_plot <- textplot_xray(kwic(unlist(docs_BG), phrase("human rights"))) + # xray plot (lexical dispersion) from Quantedq
      theme(axis.title.y=element_blank(),
            axis.text.y = element_blank(),
            axis.text.x = element_blank(),
            axis.ticks.x = element_blank(),
            axis.title = element_blank(),
            plot.title = element_blank(),
            strip.text.y = element_blank()) +
      background_grid(major = "xy", size.major = 1)
BG_risk_plot <- textplot_xray(kwic(unlist(docs_BG), "risk")) +
      theme(axis.title.y=element_blank(),
            axis.text.y = element_blank(),
            axis.text.x = element_blank(),
            axis.ticks.x = element_blank(),
            axis.title = element_blank(),
            plot.title = element_blank()) +
      background_grid(major = "xy", size.major = 1)
#  bg_docnames <- gridExtra::tableGrob(names(bg_words))

lexical_disp.p <- gridExtra::grid.arrange(BG_hr_plot, BG_risk_plot, ncol = 2, widths = 2:3)

lexical_disp.p
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
# To add caption use:

#lexical_disp <- gridExtra::grid.arrange(BG_hr_plot, BG_risk_plot, ncol = 2, widths = 2:3, bottom = grid::textGrob("Human Rights and Risk compared\nBG reports 2001-2015", gp = grid::gpar(fontsize = 12, fontface = "bold")))

We can save this plot with:

ggsave(filename = "Human_Rights_Risk_Lexical_Dispersion_Plot.png", plot = lexical_disp.p, dpi = 600, width = 14.82, height = 10.19, units = c("cm"))

Software

The following library versions were used to compile this document on the 18th of December 2018:

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2   collocateR_0.8   gridExtra_2.3    cleanNLP_2.3.0  
##  [5] cowplot_0.9.3    kableExtra_0.9.0 knitr_1.21       usethis_1.4.0   
##  [9] devtools_2.0.1   quanteda_1.3.14  tidytext_0.2.0   forcats_0.3.0   
## [13] stringr_1.3.1    dplyr_0.7.8      purrr_0.2.5      readr_1.3.0     
## [17] tidyr_0.8.2      tibble_1.4.2     ggplot2_3.1.0    tidyverse_1.2.1 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137       fs_1.2.6           lubridate_1.7.4   
##  [4] httr_1.4.0         rprojroot_1.3-2    SnowballC_0.5.1   
##  [7] tools_3.5.1        backports_1.1.3    utf8_1.1.4        
## [10] R6_2.3.0           lazyeval_0.2.1     colorspace_1.3-2  
## [13] withr_2.1.2        tidyselect_0.2.5   prettyunits_1.0.2 
## [16] processx_3.2.1     compiler_3.5.1     cli_1.0.1         
## [19] rvest_0.3.2        xml2_1.2.0         desc_1.2.0        
## [22] labeling_0.3       scales_1.0.0       callr_3.1.0       
## [25] digest_0.6.18      rmarkdown_1.11     pkgconfig_2.0.2   
## [28] htmltools_0.3.6    sessioninfo_1.1.1  rlang_0.3.0.1     
## [31] readxl_1.1.0       rstudioapi_0.8     bindr_0.1.1       
## [34] generics_0.0.2     jsonlite_1.6       tokenizers_0.2.1  
## [37] magrittr_1.5       Matrix_1.2-15      Rcpp_1.0.0        
## [40] munsell_0.5.0      fansi_0.4.0        reticulate_1.10   
## [43] stringi_1.2.4      yaml_2.2.0         pkgbuild_1.0.2    
## [46] plyr_1.8.4         grid_3.5.1         crayon_1.3.4      
## [49] lattice_0.20-38    haven_2.0.0        hms_0.4.2         
## [52] ps_1.2.1           pillar_1.3.1       reshape2_1.4.3    
## [55] stopwords_0.9.0    pkgload_1.0.2      fastmatch_1.1-0   
## [58] glue_1.3.0         evaluate_0.12      data.table_1.11.8 
## [61] remotes_2.0.2      RcppParallel_4.4.2 modelr_0.1.2      
## [64] cellranger_1.1.0   gtable_0.2.0       assertthat_0.2.0  
## [67] xfun_0.4           broom_0.5.1        janeaustenr_0.1.5 
## [70] viridisLite_0.3.0  memoise_1.1.0      spacyr_1.0

  1. Wickham H, Tidyverse: Easily Install and Load ‘Tidyverse’ Packages (2017), https://www.tidyverse.org.

  2. Benoit K, Quanteda: Quantitative Analysis of Textual Data (2018), https://quanteda.io; Silge J and Robinson D, Tidytext: Text Mining and Analysis Using Tidy Data Principles in R (2018), https://cran.r-project.org/web/packages/tidytext/index.html.