How to avoid copying and pasting code by using functions, for loops and functionals

One of the most important principles in writing software is “don’t repeat yourself (DRY)”. This module covers using functions, for-loops, and functionals to avoid copying and pasting code.

In addition to this, see the the R for Data Science chapters Functions a Iteration (for-loops and functions) for more detailed discussions of these concepts.

Setup

Consider the example from QSS Exercise 1. Review the explanations and instructions for that question.

# a QSS Exercise from Chapter 1.
library("tidyverse")

## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library("qss")

data("Kenya", package = "qss")
data("Sweden", package = "qss")
data("World", package = "qss")

Replacing Copy and Pasting Code with Functions

On question asks you to calculate the CDR (crude death rate)

# Let's write some code that works for Kenya
Kenya %>%
  group_by(period) %>%
  summarize(CDR_data = sum(deaths) / sum(py.men + py.women))

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104

Awesome! This code does what I want.

But, it only works for Kenya, and I’d need to copy and paste it to work on the Sweden and World data frames.

Sweden %>%
  group_by(period) %>%
  summarize(CDR_data = sum(deaths) / sum(py.men + py.women))

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.00984
## 2 2005-2010  0.00997

World %>%
  group_by(period) %>%
  summarize(CDR_data = sum(deaths) / sum(py.men + py.women))

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.0193 
## 2 2005-2010  0.00817

R4DS notes that I should write a function anytime I’m doing the same thing 3 or more times.

So to generalize the previous code, consider what changes when copy and pasting this data. The only thing that changes is the specification of the input data frame: Kenya, Sweden, or World. I could rewrite this code by creating a new variable x which I can assign to the input data frame I want to use.

x <- Kenya
# x <- Sweden
# x <- World

x %>%
  group_by(period) %>%
  summarize(CDR_data = sum(deaths) / sum(py.men + py.women))

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104

This helps clarify the programming task, but isn’t yet that useful because we would still have to copy and paste that code to run it in a script.

However, now that we’ve written the code such that the input is a variable x and an expression of code using that variable x. We can take that previous code and encapsulate it in a function.

CDR <- function(x) {
  x %>%
    group_by(period) %>%
    summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
}

A function consists of there parts: name, arguments (inputs), and body.

Functions are objects. You can assign them to variable names. name is the name you assign the function to. In this case, CDR.
The arguments are names that are given to input values.
The body is the code that the function executes.

Now we can run the CDR on Kenya, Sweden, and World:

CDR(Kenya)

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104

CDR(Sweden)

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.00984
## 2 2005-2010  0.00997

CDR(World)

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.0193 
## 2 2005-2010  0.00817

Repeating Inputs with For Loops

In the previous section we still had to run the function CDR three times, once for each country. While three countries is not terrible, this would be quite tedious if we had hundreds of countries.

In this section, we’ll reduce that redundancy using a for loop.

Lists

The first step is to put those data frames in a list.

countries <- list(KEN = Kenya, SWE = Sweden, WLD = World)

The names of the elements of the list are arbitrary, but were purposefully chosen to be different than the names of the data frames to emphasize that they are arbitrary.

The list countries has 3 elements,

length(countries)

## [1] 3

with names

names(countries)

## [1] "KEN" "SWE" "WLD"

To get an element from the list, use [[ or $ and reference the elements by name or index number:

summary(countries$KEN)

##    country             period              age                births      
##  Length:30          Length:30          Length:30          Min.   :   0.0  
##  Class :character   Class :character   Class :character   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :   0.0  
##                                                           Mean   : 298.3  
##                                                           3rd Qu.: 267.5  
##                                                           Max.   :2285.7  
##      deaths           py.men            py.women             l_x        
##  Min.   : 18.51   Min.   :   39.85   Min.   :   82.24   Min.   :  9457  
##  1st Qu.: 21.24   1st Qu.:  840.92   1st Qu.:  733.89   1st Qu.: 53537  
##  Median : 55.13   Median : 1712.56   Median : 1800.20   Median : 66138  
##  Mean   : 91.06   Mean   : 3681.89   Mean   : 3684.72   Mean   : 64688  
##  3rd Qu.:103.75   3rd Qu.: 4506.66   3rd Qu.: 4417.88   3rd Qu.: 81411  
##  Max.   :661.25   Max.   :15932.59   Max.   :15674.83   Max.   :100000

summary(countries[["SWE"]])

##    country             period              age                births      
##  Length:30          Length:30          Length:30          Min.   :  0.00  
##  Class :character   Class :character   Class :character   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :  0.00  
##                                                           Mean   : 36.48  
##                                                           3rd Qu.: 58.80  
##                                                           Max.   :193.19  
##      deaths            py.men          py.women           l_x        
##  Min.   :  0.191   Min.   : 250.7   Min.   : 320.6   Min.   : 35709  
##  1st Qu.:  1.522   1st Qu.:1217.2   1st Qu.:1230.0   1st Qu.: 93116  
##  Median :  4.157   Median :1383.4   Median :1358.1   Median : 96880  
##  Mean   : 26.908   Mean   :1350.0   Mean   :1364.1   Mean   : 91839  
##  3rd Qu.: 14.406   3rd Qu.:1508.7   3rd Qu.:1499.5   3rd Qu.: 98904  
##  Max.   :271.644   Max.   :2612.9   Max.   :2635.8   Max.   :100000

summary(countries[[3]])

##    country             period              age                births      
##  Length:30          Length:30          Length:30          Min.   :     0  
##  Class :character   Class :character   Class :character   1st Qu.:     0  
##  Mode  :character   Mode  :character   Mode  :character   Median :     0  
##                                                           Mean   : 38782  
##                                                           3rd Qu.: 55748  
##                                                           Max.   :219277  
##      deaths           py.men           py.women      
##  Min.   :  3248   Min.   :  30527   Min.   :  47262  
##  1st Qu.:  6263   1st Qu.: 380398   1st Qu.: 388215  
##  Median :  8104   Median : 673948   Median : 665470  
##  Mean   : 17517   Mean   : 778510   Mean   : 770349  
##  3rd Qu.: 14409   3rd Qu.:1184047   3rd Qu.:1153630  
##  Max.   :101090   Max.   :1619802   Max.   :1512274

The for loop lets us run

for (x in countries) {
  print(glimpse(x))
  print(CDR(x))
}

## Observations: 30
## Variables: 8
## $ country  <chr> "KEN", "KEN", "KEN", "KEN", "KEN", "KEN", "KEN", "KEN...
## $ period   <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age      <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births   <dbl> 0.000, 0.000, 0.000, 264.185, 485.564, 383.252, 268.6...
## $ deaths   <dbl> 398.314, 36.799, 19.520, 18.508, 21.361, 20.424, 18.9...
## $ py.men   <dbl> 2982.589, 1977.935, 1634.561, 1588.554, 1427.824, 120...
## $ py.women <dbl> 2977.828, 1969.698, 1633.975, 1564.652, 1364.061, 110...
## $ l_x      <dbl> 100000.000, 75157.419, 71364.502, 69469.475, 67421.14...
##    country    period   age   births  deaths    py.men  py.women        l_x
## 1      KEN 1950-1955   0-4    0.000 398.314  2982.589  2977.828 100000.000
## 2      KEN 1950-1955   5-9    0.000  36.799  1977.935  1969.698  75157.419
## 3      KEN 1950-1955 10-14    0.000  19.520  1634.561  1633.975  71364.502
## 4      KEN 1950-1955 15-19  264.185  18.508  1588.554  1564.652  69469.475
## 5      KEN 1950-1955 20-24  485.564  21.361  1427.824  1364.061  67421.149
## 6      KEN 1950-1955 25-29  383.252  20.424  1204.917  1105.817  64854.800
## 7      KEN 1950-1955 30-34  268.644  18.979  1033.053   928.075  62126.807
## 8      KEN 1950-1955 35-39  165.693  18.854   913.425   802.620  59238.699
## 9      KEN 1950-1955 40-44   79.582  19.301   816.753   710.981  56114.347
## 10     KEN 1950-1955 45-49   25.573  19.889   692.612   654.844  52677.510
## 11     KEN 1950-1955 50-54    0.000  21.198   568.514   592.359  48980.402
## 12     KEN 1950-1955 55-59    0.000  23.384   455.258   501.808  44720.493
## 13     KEN 1950-1955 60-69    0.000  55.056   600.284   710.673  39640.720
## 14     KEN 1950-1955 70-79    0.000  53.304   231.866   337.111  25809.675
## 15     KEN 1950-1955   80+    0.000  24.420    39.847    82.243   9457.307
## 16     KEN 2005-2010   0-4    0.000 661.251 15932.589 15674.827 100000.000
## 17     KEN 2005-2010   5-9    0.000  76.725 13254.048 13100.146  90959.093
## 18     KEN 2005-2010 10-14    0.000  66.137 11380.859 11277.374  89360.375
## 19     KEN 2005-2010 15-19 1063.629  62.441 10640.972 10575.915  88101.925
## 20     KEN 2005-2010 20-24 2285.725  75.375  9707.670  9692.037  86830.022
## 21     KEN 2005-2010 25-29 1868.313 105.366  8046.140  8020.328  85158.465
## 22     KEN 2005-2010 30-34 1119.287 132.677  6324.058  6188.021  82430.691
## 23     KEN 2005-2010 35-39  611.440 131.216  4794.929  4657.950  78352.835
## 24     KEN 2005-2010 40-44  208.039  98.897  3641.841  3697.673  72998.677
## 25     KEN 2005-2010 45-49  118.814  67.807  2892.613  3114.355  68493.362
## 26     KEN 2005-2010 50-54    0.000  55.204  2353.851  2596.142  64781.860
## 27     KEN 2005-2010 55-59    0.000  52.216  1790.568  1966.429  61285.591
## 28     KEN 2005-2010 60-69    0.000 111.775  2076.313  2325.052  57288.655
## 29     KEN 2005-2010 70-79    0.000 149.509  1122.979  1317.524  44378.060
## 30     KEN 2005-2010   80+    0.000 115.845   329.133   401.195  23198.269
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104
## Observations: 30
## Variables: 8
## $ country  <chr> "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE...
## $ period   <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age      <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births   <dbl> 0.000, 0.000, 0.000, 40.823, 141.137, 160.882, 116.90...
## $ deaths   <dbl> 13.765, 1.302, 1.216, 1.581, 2.264, 2.885, 3.610, 4.7...
## $ py.men   <dbl> 1490.037, 1542.698, 1266.321, 1078.133, 1119.421, 130...
## $ py.women <dbl> 1410.502, 1470.816, 1217.133, 1049.193, 1105.129, 128...
## $ l_x      <dbl> 100000.00, 97575.29, 97308.11, 97093.74, 96733.76, 96...
##    country    period   age  births  deaths   py.men py.women       l_x
## 1      SWE 1950-1955   0-4   0.000  13.765 1490.037 1410.502 100000.00
## 2      SWE 1950-1955   5-9   0.000   1.302 1542.698 1470.816  97575.29
## 3      SWE 1950-1955 10-14   0.000   1.216 1266.321 1217.133  97308.11
## 4      SWE 1950-1955 15-19  40.823   1.581 1078.133 1049.193  97093.74
## 5      SWE 1950-1955 20-24 141.137   2.264 1119.421 1105.129  96733.76
## 6      SWE 1950-1955 25-29 160.882   2.885 1305.003 1284.552  96223.56
## 7      SWE 1950-1955 30-34 116.906   3.610 1367.220 1338.146  95692.18
## 8      SWE 1950-1955 35-39  64.795   4.704 1365.747 1333.127  95041.54
## 9      SWE 1950-1955 40-44  21.824   6.809 1366.917 1346.314  94214.33
## 10     SWE 1950-1955 45-49   1.702  10.015 1256.239 1268.418  93037.78
## 11     SWE 1950-1955 50-54   0.000  14.225 1101.377 1139.260  91196.10
## 12     SWE 1950-1955 55-59   0.000  19.862  944.579 1008.943  88360.78
## 13     SWE 1950-1955 60-69   0.000  65.883 1456.084 1620.309  83988.01
## 14     SWE 1950-1955 70-79   0.000 106.465  829.383  945.557  67496.43
## 15     SWE 1950-1955   80+   0.000  95.869  250.678  320.593  35708.75
## 16     SWE 2005-2010   0-4   0.000   1.801 1361.938 1290.214 100000.00
## 17     SWE 2005-2010   5-9   0.000   0.191 1204.157 1142.830  99675.23
## 18     SWE 2005-2010 10-14   0.000   0.320 1445.904 1372.248  99631.15
## 19     SWE 2005-2010 15-19   9.000   0.832 1588.189 1507.308  99584.84
## 20     SWE 2005-2010 20-24  69.501   1.318 1435.878 1369.963  99440.57
## 21     SWE 2005-2010 25-29 155.772   1.354 1399.640 1340.452  99205.05
## 22     SWE 2005-2010 30-34 193.185   1.503 1511.593 1460.486  98966.03
## 23     SWE 2005-2010 35-39  98.987   2.155 1639.975 1581.454  98717.43
## 24     SWE 2005-2010 40-44  19.122   3.336 1637.465 1572.523  98386.34
## 25     SWE 2005-2010 45-49   0.907   5.310 1524.391 1476.251  97871.49
## 26     SWE 2005-2010 50-54   0.000   8.590 1450.930 1423.215  97027.02
## 27     SWE 2005-2010 55-59   0.000  14.466 1541.470 1529.924  95583.16
## 28     SWE 2005-2010 60-69   0.000  51.588 2612.883 2635.789  93349.47
## 29     SWE 2005-2010 70-79   0.000  92.385 1500.119 1794.682  84371.90
## 30     SWE 2005-2010   80+   0.000 271.644  904.764 1567.216  63683.52
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.00984
## 2 2005-2010  0.00997
## Observations: 30
## Variables: 7
## $ country  <chr> "WORLD", "WORLD", "WORLD", "WORLD", "WORLD", "WORLD",...
## $ period   <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age      <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births   <dbl> 0.000, 0.000, 0.000, 54238.336, 131939.226, 128063.23...
## $ deaths   <dbl> 101090.393, 7960.861, 5550.176, 5831.195, 6649.425, 6...
## $ py.men   <dbl> 946910.59, 726903.29, 666794.72, 626191.13, 573518.17...
## $ py.women <dbl> 904909.28, 694574.52, 635492.29, 600677.87, 555221.02...
##    country    period   age     births     deaths     py.men   py.women
## 1    WORLD 1950-1955   0-4      0.000 101090.393  946910.59  904909.28
## 2    WORLD 1950-1955   5-9      0.000   7960.861  726903.29  694574.52
## 3    WORLD 1950-1955 10-14      0.000   5550.176  666794.72  635492.29
## 4    WORLD 1950-1955 15-19  54238.336   5831.195  626191.13  600677.87
## 5    WORLD 1950-1955 20-24 131939.226   6649.425  573518.17  555221.02
## 6    WORLD 1950-1955 25-29 128063.230   6425.219  508500.43  507276.96
## 7    WORLD 1950-1955 30-34  89246.610   6208.629  433338.38  437131.76
## 8    WORLD 1950-1955 35-39  56005.545   6877.151  400279.43  405527.72
## 9    WORLD 1950-1955 40-44  24326.598   7995.126  373771.34  382443.66
## 10   WORLD 1950-1955 45-49   5071.982   8885.744  326278.28  333888.55
## 11   WORLD 1950-1955 50-54      0.000   9690.833  273318.07  285689.90
## 12   WORLD 1950-1955 55-59      0.000  11109.256  219381.03  238443.39
## 13   WORLD 1950-1955 60-69      0.000  27701.495  302214.30  353255.95
## 14   WORLD 1950-1955 70-79      0.000  26702.179  133347.24  173890.98
## 15   WORLD 1950-1955   80+      0.000  14341.537   30527.16   47261.68
## 16   WORLD 2005-2010   0-4      0.000  40098.376 1619801.88 1512273.96
## 17   WORLD 2005-2010   5-9      0.000   3757.199 1544406.24 1444845.83
## 18   WORLD 2005-2010 10-14      0.000   3247.765 1552687.63 1457101.18
## 19   WORLD 2005-2010 15-19  73194.689   4043.199 1593937.86 1509488.81
## 20   WORLD 2005-2010 20-24 219277.185   5399.345 1503387.77 1442885.43
## 21   WORLD 2005-2010 25-29 190318.699   5996.753 1337034.07 1294852.69
## 22   WORLD 2005-2010 30-34 114931.466   6519.377 1259434.33 1225101.75
## 23   WORLD 2005-2010 35-39  54975.769   7238.027 1210083.54 1177472.56
## 24   WORLD 2005-2010 40-44  17604.730   8212.596 1105937.47 1082103.13
## 25   WORLD 2005-2010 45-49   4278.729   9709.548  960561.39  948668.83
## 26   WORLD 2005-2010 50-54      0.000  11955.139  840307.73  837232.56
## 27   WORLD 2005-2010 55-59      0.000  14431.099  681101.65  696280.73
## 28   WORLD 2005-2010 60-69      0.000  38270.238  908660.24  982545.45
## 29   WORLD 2005-2010 70-79      0.000  54685.244  515933.20  636365.65
## 30   WORLD 2005-2010   80+      0.000  58928.480  180743.64  307562.46
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.0193 
## 2 2005-2010  0.00817

The for loop runs three times:

Set x = countries[[1]], and run CDR(x)
Set x = countries[[2]], and run CDR(x)
Set x = countries[[3]], and run CDR(x)

However, while we’ve run CDR on each country we haven’t saved the results to use anywhere. We’ll make three changes to our code:

Define an empty list cdr_results of the same length as countries to store the results
Loop over the names of the countries instead of the values so we can name the elements of the results.
Within the loop save the result to an element in cdr_results

First, create an empty vector the same length as countries.

cdr_results <- vector("list", length = length(countries))

cdr_results

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL

Then, loop over each country, running CDR on that country’s data frame and saving the results to an element of countries:

for (i in names(countries)) {
  cdr_results[[i]] <- CDR(countries[[i]])
}

Now, cdr_results contains the results of running CDR on each of those data frames:

cdr_results[["KEN"]]

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104

cdr_results[["SWE"]]

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.00984
## 2 2005-2010  0.00997

cdr_results[["WLD"]]

## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.0193 
## 2 2005-2010  0.00817

Replacing for loops with map functions

One annoyance and possible place to make bugs in for loops is that we need to define a vector to store the results. This requires that we create the vector with the correct length, and remember to update the same vector.

The map functions in the purr package are functions that apply a function to each element of a vector.

The function map(.x, .f) applies the function .f to each element of the vector .x. The result is

list(.f(.x[[1]]), .f(.x[[2]]), ..., .f(.x[[length(.x)]]))

Using the map function we can replace the previous for loop with a single line of code:

map(countries, CDR)

## $KEN
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955   0.0240
## 2 2005-2010   0.0104
## 
## $SWE
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.00984
## 2 2005-2010  0.00997
## 
## $WLD
## # A tibble: 2 x 2
##   period    CDR_data
##   <chr>        <dbl>
## 1 1950-1955  0.0193 
## 2 2005-2010  0.00817

This applied the function CDR to each data frame in the countries list and returned another list with all the results. Unlike using a for loop we did not need to first create an empty vector.