Notes on R for Data Science

Visualization

Other resources:

Tibbles

Why might you want to create non-syntactic variable names? Since variable names are often used as in plots (e.g. axis-titles) or headers in tables, where having spaces or other characters that are invalid R variable names is useful. Those functions will have ways to use text other than the column, but using the non-syntactic variable name can be convenient.

If nycflights::flights were printed in the console it would be much worse. Just try it, I dare you.

as.data.frame(nycflights13::flights)

Partial matching is bad! Suppose you have this data frame:

d <- data.frame(foobar = 1:3)

This code works to extract foobar,

d$foobar
## [1] 1 2 3

but so does this,

d$foo
## [1] 1 2 3

This is called partial matching, since foo matches foobar. R uses partial matching with $ and argument names in functions.

This seems convenient. Who wants to type extra characters? Now suppose you have a data frame with two variables foobar and foo. However, when creating it you accidentally misspell foo as fo.

d <- data.frame(foobar = 1:3, fo = 5:7)

If you run this,

d$foo
## [1] 1 2 3

it will still return a value, and you may not realize your initial error. It would have been better if $ had returned NULL or raised an error since then you would have realized the bug and fixed it. However, tibble objects never partially match column names with $:

library("tibble")
as.tibble(d)$foo
## Warning: Unknown or uninitialised column: 'foo'.
## NULL

Additionally, [[ does not partially match:

d[["foo"]]
## NULL

Partial matching is bad; never do it. Tibbles prevent you from doing it.

Interacting with Older Code

Not all older functions work with tibbles (an example includes the Amelia package). This is because rely may assume the very quirks in data.frame behavior that tibbles fix. Use as.data.frame() to turn a tibble back into a data.frame. These problems are usually due to [ inconsistently returning a vector or a data frame.

Tidy Data

The Rules

  1. Each variable has its own column
  2. Each observation must have its own row
  3. Each value must have its own cell

or even

  1. Put each dataset in a tibble
  2. Put each variable in a column

These seem obvious at first, so we need to see examples of not-following tidy data and what happens.

Some nuances:

The definitions of variable, observation, and value are not always clear. And how you store and arrange the data can depend on how you aim to use it. Generally, aim for storing the data in a tidy format that ensures minimal errors. When you model it, you can transform the data later. See non-tidy data.

It is easier to work with variables in columns because of mutate and summary functions. It will also work better with tidyverse functions: e.g. using group_by to group and summarize, or facet_* and aesthetics in ggplot2.

The tidy data ideas are adapted from the database normalization, but simplified and adapted to the general uses of practicing data scientists.

Non-tidy data

Replication datasets and datasets used in analysis are often non-tidy. Why?

Dates and Times

Relational Data

nycflights13 is an example of a data-only R package. R packages can contain both functions and data. Since data-sets can get large, often they can be packaged as their own dataset. These sorts of data-only R packages make it convenient for R users to access your data, but it should not be the only way you provide your research data. Not everyone uses R, so the original data should be provided in a program agnostic format (e.g. CSV files). This also holds for those using Stata; they should not be distributing data in .dta format files specific to Stata (even if as we saw earlier, other programs can read that data.) Another example of a data-only R package is the gapminder package.

Strings

The discussion of encodings is detailed, but these details can make your life hell. Skim now, but be aware that what should be simple, actually is not.

After reading this you should be able to answer:

You don’t need to memorize the various types of character encoding and regular expressions. For the former, knowing the existence of it and being able to recognize when you are having a character encoding issue is the key skill. After that Google is your friend. For the latter, you should be aware that regular expressions are a way to match string patterns. When you need to match a string pattern, look up the regular expression. You will learn regular expressions over time.

The rex package functions that allow you to write regular expressions with R functions rather than the more compact but sometimes difficult to interpret regex syntax.

Warning There are several variants of regular expressions out there. All use a similar grammar. However, each language or library may implement slightly different features.

This Computerphile video on Unicode is great Characters, Symbols and the Unicode Miracle - Computerphile

This suggested reading is very useful: http://kunststube.net/encoding/

Also see:

Dates and Times

This section seems less complete than the others. Refer to the lubridate vignette for more information.

Vectors

  • What is the difference between atomic vectors and lists?
  • What is the difference between numeric vectors and integer vectors?
  • List several problems that may occur with floating point arithmetic

Why does floating point math matter? For 99% of the work you are likely to do as an applied social scientist, you need to know it. Someone else has written the numerical methods and (hopefully) accounted for numerical issues. However, if you are not even aware that “floating point numbers” are a “thing”, if something goes wrong, it will seem like magic. Also, at least being aware of these problems will help you understand error messages from optimization routines that complaining of “numerical precision”.

Computerphile has a good video on Floating Point Numbers.

Be careful testing equality with floating point numbers:

0.1 + 0.2 == 0.3
## [1] FALSE
0.15 + 0.15 >= 0.1 + 0.2
## [1] FALSE

To test for equality with floating point numbers use all.equal:

all.equal(0.15 + 0.15, 0.1 + 0.2)
## [1] TRUE

For functions that test equality in floating point numbers there will be some sort of “tolerance” argument which is the difference at which two numbers can be equal.

Numbers too large to represent will “overflow” to Inf:

1e1000000
## [1] Inf

This is true even if the final expression would be finite. E.g. \(\log(e^10000) = 10000\), but in R,

log(exp(10000))
## [1] Inf

because \(e^10000 \approx \infty\) and \(\log(\infty) = \infty\).

Models

Some of the discussion of models is slightly different, and has a different emphasis than in most social science research. This is largely because this book is speaking to data scientists, where the primary goal is prediction rather than theory testing (that I don’t view these as too different is a different story).

The discussion about hypothesis generation vs. confirmation is interesting. Too little emphasis is placed on hypothesis generation in social science. The importance of out of sample testing also receives too little emphasis in political science.

And from this discussion it should be clear that many papers in social science are hypothesis generation masquerading as hypothesis confirmation.

Model Basics

Distinction between family of models and fitted model is a useful way to think about models. Especially as we can abstract some families of models to be themselves a fitted model of a more flexible family of models. For example, linear regression is a special case of GLM or Gaussian Processes etc.

NOTE It’s worth mentioning these as more general models. Though they don’t appear as much in social science work. I should try to explain that. I can think of several reasons

Model basics

For more complex models, visualize their

  1. predictions
  2. residuals