How to avoid copying and pasting code by using functions, for loops and functionals
One of the most important principles in writing software is “don’t repeat yourself (DRY)”. This module covers using functions, for-loops, and functionals to avoid copying and pasting code.
In addition to this, see the the R for Data Science chapters Functions a Iteration (for-loops and functions) for more detailed discussions of these concepts.
Setup
Consider the example from QSS Exercise 1. Review the explanations and instructions for that question.
# a QSS Exercise from Chapter 1.
library("tidyverse")
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.1 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library("qss")
data("Kenya", package = "qss")
data("Sweden", package = "qss")
data("World", package = "qss")
Replacing Copy and Pasting Code with Functions
On question asks you to calculate the CDR (crude death rate)
# Let's write some code that works for Kenya
Kenya %>%
group_by(period) %>%
summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
Awesome! This code does what I want.
But, it only works for Kenya, and I’d need to copy and paste it to work on the Sweden
and World
data frames.
Sweden %>%
group_by(period) %>%
summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.00984
## 2 2005-2010 0.00997
World %>%
group_by(period) %>%
summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0193
## 2 2005-2010 0.00817
R4DS notes that I should write a function anytime I’m doing the same thing 3 or more times.
So to generalize the previous code, consider what changes when copy and pasting this data. The only thing that changes is the specification of the input data frame: Kenya
, Sweden
, or World
. I could rewrite this code by creating a new variable x
which I can assign to the input data frame I want to use.
x <- Kenya
# x <- Sweden
# x <- World
x %>%
group_by(period) %>%
summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
This helps clarify the programming task, but isn’t yet that useful because we would still have to copy and paste that code to run it in a script.
However, now that we’ve written the code such that the input is a variable x
and an expression of code using that variable x
. We can take that previous code and encapsulate it in a function.
CDR <- function(x) {
x %>%
group_by(period) %>%
summarize(CDR_data = sum(deaths) / sum(py.men + py.women))
}
A function consists of there parts: name
, arguments
(inputs), and body
.
- Functions are objects. You can assign them to variable names.
name
is the name you assign the function to. In this case,CDR
. - The
arguments
are names that are given to input values. - The
body
is the code that the function executes.
Now we can run the CDR
on Kenya, Sweden, and World:
CDR(Kenya)
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
CDR(Sweden)
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.00984
## 2 2005-2010 0.00997
CDR(World)
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0193
## 2 2005-2010 0.00817
Repeating Inputs with For Loops
In the previous section we still had to run the function CDR
three times, once for each country. While three countries is not terrible, this would be quite tedious if we had hundreds of countries.
In this section, we’ll reduce that redundancy using a for loop.
Lists
The first step is to put those data frames in a list.
countries <- list(KEN = Kenya, SWE = Sweden, WLD = World)
The names of the elements of the list are arbitrary, but were purposefully chosen to be different than the names of the data frames to emphasize that they are arbitrary.
The list countries
has 3 elements,
length(countries)
## [1] 3
with names
names(countries)
## [1] "KEN" "SWE" "WLD"
To get an element from the list, use [[
or $
and reference the elements by name or index number:
summary(countries$KEN)
## country period age births
## Length:30 Length:30 Length:30 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 0.0
## Mode :character Mode :character Mode :character Median : 0.0
## Mean : 298.3
## 3rd Qu.: 267.5
## Max. :2285.7
## deaths py.men py.women l_x
## Min. : 18.51 Min. : 39.85 Min. : 82.24 Min. : 9457
## 1st Qu.: 21.24 1st Qu.: 840.92 1st Qu.: 733.89 1st Qu.: 53537
## Median : 55.13 Median : 1712.56 Median : 1800.20 Median : 66138
## Mean : 91.06 Mean : 3681.89 Mean : 3684.72 Mean : 64688
## 3rd Qu.:103.75 3rd Qu.: 4506.66 3rd Qu.: 4417.88 3rd Qu.: 81411
## Max. :661.25 Max. :15932.59 Max. :15674.83 Max. :100000
summary(countries[["SWE"]])
## country period age births
## Length:30 Length:30 Length:30 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 36.48
## 3rd Qu.: 58.80
## Max. :193.19
## deaths py.men py.women l_x
## Min. : 0.191 Min. : 250.7 Min. : 320.6 Min. : 35709
## 1st Qu.: 1.522 1st Qu.:1217.2 1st Qu.:1230.0 1st Qu.: 93116
## Median : 4.157 Median :1383.4 Median :1358.1 Median : 96880
## Mean : 26.908 Mean :1350.0 Mean :1364.1 Mean : 91839
## 3rd Qu.: 14.406 3rd Qu.:1508.7 3rd Qu.:1499.5 3rd Qu.: 98904
## Max. :271.644 Max. :2612.9 Max. :2635.8 Max. :100000
summary(countries[[3]])
## country period age births
## Length:30 Length:30 Length:30 Min. : 0
## Class :character Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Mode :character Median : 0
## Mean : 38782
## 3rd Qu.: 55748
## Max. :219277
## deaths py.men py.women
## Min. : 3248 Min. : 30527 Min. : 47262
## 1st Qu.: 6263 1st Qu.: 380398 1st Qu.: 388215
## Median : 8104 Median : 673948 Median : 665470
## Mean : 17517 Mean : 778510 Mean : 770349
## 3rd Qu.: 14409 3rd Qu.:1184047 3rd Qu.:1153630
## Max. :101090 Max. :1619802 Max. :1512274
The for loop lets us run
for (x in countries) {
print(glimpse(x))
print(CDR(x))
}
## Observations: 30
## Variables: 8
## $ country <chr> "KEN", "KEN", "KEN", "KEN", "KEN", "KEN", "KEN", "KEN...
## $ period <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births <dbl> 0.000, 0.000, 0.000, 264.185, 485.564, 383.252, 268.6...
## $ deaths <dbl> 398.314, 36.799, 19.520, 18.508, 21.361, 20.424, 18.9...
## $ py.men <dbl> 2982.589, 1977.935, 1634.561, 1588.554, 1427.824, 120...
## $ py.women <dbl> 2977.828, 1969.698, 1633.975, 1564.652, 1364.061, 110...
## $ l_x <dbl> 100000.000, 75157.419, 71364.502, 69469.475, 67421.14...
## country period age births deaths py.men py.women l_x
## 1 KEN 1950-1955 0-4 0.000 398.314 2982.589 2977.828 100000.000
## 2 KEN 1950-1955 5-9 0.000 36.799 1977.935 1969.698 75157.419
## 3 KEN 1950-1955 10-14 0.000 19.520 1634.561 1633.975 71364.502
## 4 KEN 1950-1955 15-19 264.185 18.508 1588.554 1564.652 69469.475
## 5 KEN 1950-1955 20-24 485.564 21.361 1427.824 1364.061 67421.149
## 6 KEN 1950-1955 25-29 383.252 20.424 1204.917 1105.817 64854.800
## 7 KEN 1950-1955 30-34 268.644 18.979 1033.053 928.075 62126.807
## 8 KEN 1950-1955 35-39 165.693 18.854 913.425 802.620 59238.699
## 9 KEN 1950-1955 40-44 79.582 19.301 816.753 710.981 56114.347
## 10 KEN 1950-1955 45-49 25.573 19.889 692.612 654.844 52677.510
## 11 KEN 1950-1955 50-54 0.000 21.198 568.514 592.359 48980.402
## 12 KEN 1950-1955 55-59 0.000 23.384 455.258 501.808 44720.493
## 13 KEN 1950-1955 60-69 0.000 55.056 600.284 710.673 39640.720
## 14 KEN 1950-1955 70-79 0.000 53.304 231.866 337.111 25809.675
## 15 KEN 1950-1955 80+ 0.000 24.420 39.847 82.243 9457.307
## 16 KEN 2005-2010 0-4 0.000 661.251 15932.589 15674.827 100000.000
## 17 KEN 2005-2010 5-9 0.000 76.725 13254.048 13100.146 90959.093
## 18 KEN 2005-2010 10-14 0.000 66.137 11380.859 11277.374 89360.375
## 19 KEN 2005-2010 15-19 1063.629 62.441 10640.972 10575.915 88101.925
## 20 KEN 2005-2010 20-24 2285.725 75.375 9707.670 9692.037 86830.022
## 21 KEN 2005-2010 25-29 1868.313 105.366 8046.140 8020.328 85158.465
## 22 KEN 2005-2010 30-34 1119.287 132.677 6324.058 6188.021 82430.691
## 23 KEN 2005-2010 35-39 611.440 131.216 4794.929 4657.950 78352.835
## 24 KEN 2005-2010 40-44 208.039 98.897 3641.841 3697.673 72998.677
## 25 KEN 2005-2010 45-49 118.814 67.807 2892.613 3114.355 68493.362
## 26 KEN 2005-2010 50-54 0.000 55.204 2353.851 2596.142 64781.860
## 27 KEN 2005-2010 55-59 0.000 52.216 1790.568 1966.429 61285.591
## 28 KEN 2005-2010 60-69 0.000 111.775 2076.313 2325.052 57288.655
## 29 KEN 2005-2010 70-79 0.000 149.509 1122.979 1317.524 44378.060
## 30 KEN 2005-2010 80+ 0.000 115.845 329.133 401.195 23198.269
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
## Observations: 30
## Variables: 8
## $ country <chr> "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE", "SWE...
## $ period <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births <dbl> 0.000, 0.000, 0.000, 40.823, 141.137, 160.882, 116.90...
## $ deaths <dbl> 13.765, 1.302, 1.216, 1.581, 2.264, 2.885, 3.610, 4.7...
## $ py.men <dbl> 1490.037, 1542.698, 1266.321, 1078.133, 1119.421, 130...
## $ py.women <dbl> 1410.502, 1470.816, 1217.133, 1049.193, 1105.129, 128...
## $ l_x <dbl> 100000.00, 97575.29, 97308.11, 97093.74, 96733.76, 96...
## country period age births deaths py.men py.women l_x
## 1 SWE 1950-1955 0-4 0.000 13.765 1490.037 1410.502 100000.00
## 2 SWE 1950-1955 5-9 0.000 1.302 1542.698 1470.816 97575.29
## 3 SWE 1950-1955 10-14 0.000 1.216 1266.321 1217.133 97308.11
## 4 SWE 1950-1955 15-19 40.823 1.581 1078.133 1049.193 97093.74
## 5 SWE 1950-1955 20-24 141.137 2.264 1119.421 1105.129 96733.76
## 6 SWE 1950-1955 25-29 160.882 2.885 1305.003 1284.552 96223.56
## 7 SWE 1950-1955 30-34 116.906 3.610 1367.220 1338.146 95692.18
## 8 SWE 1950-1955 35-39 64.795 4.704 1365.747 1333.127 95041.54
## 9 SWE 1950-1955 40-44 21.824 6.809 1366.917 1346.314 94214.33
## 10 SWE 1950-1955 45-49 1.702 10.015 1256.239 1268.418 93037.78
## 11 SWE 1950-1955 50-54 0.000 14.225 1101.377 1139.260 91196.10
## 12 SWE 1950-1955 55-59 0.000 19.862 944.579 1008.943 88360.78
## 13 SWE 1950-1955 60-69 0.000 65.883 1456.084 1620.309 83988.01
## 14 SWE 1950-1955 70-79 0.000 106.465 829.383 945.557 67496.43
## 15 SWE 1950-1955 80+ 0.000 95.869 250.678 320.593 35708.75
## 16 SWE 2005-2010 0-4 0.000 1.801 1361.938 1290.214 100000.00
## 17 SWE 2005-2010 5-9 0.000 0.191 1204.157 1142.830 99675.23
## 18 SWE 2005-2010 10-14 0.000 0.320 1445.904 1372.248 99631.15
## 19 SWE 2005-2010 15-19 9.000 0.832 1588.189 1507.308 99584.84
## 20 SWE 2005-2010 20-24 69.501 1.318 1435.878 1369.963 99440.57
## 21 SWE 2005-2010 25-29 155.772 1.354 1399.640 1340.452 99205.05
## 22 SWE 2005-2010 30-34 193.185 1.503 1511.593 1460.486 98966.03
## 23 SWE 2005-2010 35-39 98.987 2.155 1639.975 1581.454 98717.43
## 24 SWE 2005-2010 40-44 19.122 3.336 1637.465 1572.523 98386.34
## 25 SWE 2005-2010 45-49 0.907 5.310 1524.391 1476.251 97871.49
## 26 SWE 2005-2010 50-54 0.000 8.590 1450.930 1423.215 97027.02
## 27 SWE 2005-2010 55-59 0.000 14.466 1541.470 1529.924 95583.16
## 28 SWE 2005-2010 60-69 0.000 51.588 2612.883 2635.789 93349.47
## 29 SWE 2005-2010 70-79 0.000 92.385 1500.119 1794.682 84371.90
## 30 SWE 2005-2010 80+ 0.000 271.644 904.764 1567.216 63683.52
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.00984
## 2 2005-2010 0.00997
## Observations: 30
## Variables: 7
## $ country <chr> "WORLD", "WORLD", "WORLD", "WORLD", "WORLD", "WORLD",...
## $ period <chr> "1950-1955", "1950-1955", "1950-1955", "1950-1955", "...
## $ age <chr> "0-4", "5-9", "10-14", "15-19", "20-24", "25-29", "30...
## $ births <dbl> 0.000, 0.000, 0.000, 54238.336, 131939.226, 128063.23...
## $ deaths <dbl> 101090.393, 7960.861, 5550.176, 5831.195, 6649.425, 6...
## $ py.men <dbl> 946910.59, 726903.29, 666794.72, 626191.13, 573518.17...
## $ py.women <dbl> 904909.28, 694574.52, 635492.29, 600677.87, 555221.02...
## country period age births deaths py.men py.women
## 1 WORLD 1950-1955 0-4 0.000 101090.393 946910.59 904909.28
## 2 WORLD 1950-1955 5-9 0.000 7960.861 726903.29 694574.52
## 3 WORLD 1950-1955 10-14 0.000 5550.176 666794.72 635492.29
## 4 WORLD 1950-1955 15-19 54238.336 5831.195 626191.13 600677.87
## 5 WORLD 1950-1955 20-24 131939.226 6649.425 573518.17 555221.02
## 6 WORLD 1950-1955 25-29 128063.230 6425.219 508500.43 507276.96
## 7 WORLD 1950-1955 30-34 89246.610 6208.629 433338.38 437131.76
## 8 WORLD 1950-1955 35-39 56005.545 6877.151 400279.43 405527.72
## 9 WORLD 1950-1955 40-44 24326.598 7995.126 373771.34 382443.66
## 10 WORLD 1950-1955 45-49 5071.982 8885.744 326278.28 333888.55
## 11 WORLD 1950-1955 50-54 0.000 9690.833 273318.07 285689.90
## 12 WORLD 1950-1955 55-59 0.000 11109.256 219381.03 238443.39
## 13 WORLD 1950-1955 60-69 0.000 27701.495 302214.30 353255.95
## 14 WORLD 1950-1955 70-79 0.000 26702.179 133347.24 173890.98
## 15 WORLD 1950-1955 80+ 0.000 14341.537 30527.16 47261.68
## 16 WORLD 2005-2010 0-4 0.000 40098.376 1619801.88 1512273.96
## 17 WORLD 2005-2010 5-9 0.000 3757.199 1544406.24 1444845.83
## 18 WORLD 2005-2010 10-14 0.000 3247.765 1552687.63 1457101.18
## 19 WORLD 2005-2010 15-19 73194.689 4043.199 1593937.86 1509488.81
## 20 WORLD 2005-2010 20-24 219277.185 5399.345 1503387.77 1442885.43
## 21 WORLD 2005-2010 25-29 190318.699 5996.753 1337034.07 1294852.69
## 22 WORLD 2005-2010 30-34 114931.466 6519.377 1259434.33 1225101.75
## 23 WORLD 2005-2010 35-39 54975.769 7238.027 1210083.54 1177472.56
## 24 WORLD 2005-2010 40-44 17604.730 8212.596 1105937.47 1082103.13
## 25 WORLD 2005-2010 45-49 4278.729 9709.548 960561.39 948668.83
## 26 WORLD 2005-2010 50-54 0.000 11955.139 840307.73 837232.56
## 27 WORLD 2005-2010 55-59 0.000 14431.099 681101.65 696280.73
## 28 WORLD 2005-2010 60-69 0.000 38270.238 908660.24 982545.45
## 29 WORLD 2005-2010 70-79 0.000 54685.244 515933.20 636365.65
## 30 WORLD 2005-2010 80+ 0.000 58928.480 180743.64 307562.46
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0193
## 2 2005-2010 0.00817
The for loop runs three times:
- Set
x = countries[[1]]
, and runCDR(x)
- Set
x = countries[[2]]
, and runCDR(x)
- Set
x = countries[[3]]
, and runCDR(x)
However, while we’ve run CDR
on each country we haven’t saved the results to use anywhere. We’ll make three changes to our code:
- Define an empty list
cdr_results
of the same length ascountries
to store the results - Loop over the names of the countries instead of the values so we can name the elements of the results.
- Within the loop save the result to an element in
cdr_results
First, create an empty vector the same length as countries.
cdr_results <- vector("list", length = length(countries))
cdr_results
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
Then, loop over each country, running CDR
on that country’s data frame and saving the results to an element of countries
:
for (i in names(countries)) {
cdr_results[[i]] <- CDR(countries[[i]])
}
Now, cdr_results
contains the results of running CDR
on each of those data frames:
cdr_results[["KEN"]]
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
cdr_results[["SWE"]]
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.00984
## 2 2005-2010 0.00997
cdr_results[["WLD"]]
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0193
## 2 2005-2010 0.00817
Replacing for loops with map functions
One annoyance and possible place to make bugs in for loops is that we need to define a vector to store the results. This requires that we create the vector with the correct length, and remember to update the same vector.
The map functions in the purr package are functions that apply a function to each element of a vector.
The function map(.x, .f)
applies the function .f
to each element of the vector .x
. The result is
list(.f(.x[[1]]), .f(.x[[2]]), ..., .f(.x[[length(.x)]]))
Using the map
function we can replace the previous for loop with a single line of code:
map(countries, CDR)
## $KEN
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0240
## 2 2005-2010 0.0104
##
## $SWE
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.00984
## 2 2005-2010 0.00997
##
## $WLD
## # A tibble: 2 x 2
## period CDR_data
## <chr> <dbl>
## 1 1950-1955 0.0193
## 2 2005-2010 0.00817
This applied the function CDR
to each data frame in the countries
list and returned another list with all the results. Unlike using a for loop we did not need to first create an empty vector.