Notes on R for Data Science

Visualization

Other resources:

All the graph things Stat 545. Jenny Bryan.

Tibbles

Why might you want to create non-syntactic variable names? Since variable names are often used as in plots (e.g. axis-titles) or headers in tables, where having spaces or other characters that are invalid R variable names is useful. Those functions will have ways to use text other than the column, but using the non-syntactic variable name can be convenient.

If nycflights::flights were printed in the console it would be much worse. Just try it, I dare you.

as.data.frame(nycflights13::flights)

Partial matching is bad! Suppose you have this data frame:

d <- data.frame(foobar = 1:3)

This code works to extract foobar,

d$foobar

## [1] 1 2 3

but so does this,

d$foo

## [1] 1 2 3

This is called partial matching, since foo matches foobar. R uses partial matching with $ and argument names in functions.

This seems convenient. Who wants to type extra characters? Now suppose you have a data frame with two variables foobar and foo. However, when creating it you accidentally misspell foo as fo.

d <- data.frame(foobar = 1:3, fo = 5:7)

If you run this,

d$foo

## [1] 1 2 3

it will still return a value, and you may not realize your initial error. It would have been better if $ had returned NULL or raised an error since then you would have realized the bug and fixed it. However, tibble objects never partially match column names with $:

library("tibble")
as.tibble(d)$foo

## Warning: Unknown or uninitialised column: 'foo'.

## NULL

Additionally, [[ does not partially match:

d[["foo"]]

## NULL

Partial matching is bad; never do it. Tibbles prevent you from doing it.

Interacting with Older Code

Not all older functions work with tibbles (an example includes the Amelia package). This is because rely may assume the very quirks in data.frame behavior that tibbles fix. Use as.data.frame() to turn a tibble back into a data.frame. These problems are usually due to [ inconsistently returning a vector or a data frame.

Tidy Data

Describe what tidy data is and why it is important?
Provide some examples of non-tidy data
Be able to convert non-tidy data to tidy data
Provide some examples when non-tidy data formats would be appropriate or useful

The Rules

Each variable has its own column
Each observation must have its own row
Each value must have its own cell

or even

Put each dataset in a tibble
Put each variable in a column

These seem obvious at first, so we need to see examples of not-following tidy data and what happens.

Some nuances:

The definitions of variable, observation, and value are not always clear. And how you store and arrange the data can depend on how you aim to use it. Generally, aim for storing the data in a tidy format that ensures minimal errors. When you model it, you can transform the data later. See non-tidy data.

It is easier to work with variables in columns because of mutate and summary functions. It will also work better with tidyverse functions: e.g. using group_by to group and summarize, or facet_* and aesthetics in ggplot2.

The tidy data ideas are adapted from the database normalization, but simplified and adapted to the general uses of practicing data scientists.

Non-tidy data

Corpus and text data is often stored in sparse Matrices https://cran.r-project.org/web/packages/tm/tm.pdf
Graphical data has its own format: http://igraph.org/r/doc/

Replication datasets and datasets used in analysis are often non-tidy. Why?

Dates and Times

Ideas for applications: CDB90 data, COW war start end and duration
Read more on time-zones: https://en.wikipedia.org/wiki/Time_zone
Computerphile The Problem with Time & Timezones - Computerphile
The history of the tz database are themselves interesting: https://en.wikipedia.org/wiki/Tz_database
A literary appreciation of the Olson/Zoneinfo/tz database

Relational Data

What is relational data?
What is a primary key?
List and describe the types of joins
What are set union, intersection, complement, and difference?

nycflights13 is an example of a data-only R package. R packages can contain both functions and data. Since data-sets can get large, often they can be packaged as their own dataset. These sorts of data-only R packages make it convenient for R users to access your data, but it should not be the only way you provide your research data. Not everyone uses R, so the original data should be provided in a program agnostic format (e.g. CSV files). This also holds for those using Stata; they should not be distributing data in .dta format files specific to Stata (even if as we saw earlier, other programs can read that data.) Another example of a data-only R package is the gapminder package.

Strings

The discussion of encodings is detailed, but these details can make your life hell. Skim now, but be aware that what should be simple, actually is not.

After reading this you should be able to answer:

What are character encodings? Why might they be an issue in data cleaning?
What are regular expressions? When would you use them.

You don’t need to memorize the various types of character encoding and regular expressions. For the former, knowing the existence of it and being able to recognize when you are having a character encoding issue is the key skill. After that Google is your friend. For the latter, you should be aware that regular expressions are a way to match string patterns. When you need to match a string pattern, look up the regular expression. You will learn regular expressions over time.

The rex package functions that allow you to write regular expressions with R functions rather than the more compact but sometimes difficult to interpret regex syntax.

Warning There are several variants of regular expressions out there. All use a similar grammar. However, each language or library may implement slightly different features.

This Computerphile video on Unicode is great Characters, Symbols and the Unicode Miracle - Computerphile

This suggested reading is very useful: http://kunststube.net/encoding/

Also see:

Character encoding Stat 545. Jenny Bryan.
Character data. Stat 545. Jenny Bryan.
Regular expressions in R. Stat 545. Jenny Bryan.

Dates and Times

This section seems less complete than the others. Refer to the lubridate vignette for more information.

Vectors

What is the difference between atomic vectors and lists?
What is the difference between numeric vectors and integer vectors?
List several problems that may occur with floating point arithmetic

Why does floating point math matter? For 99% of the work you are likely to do as an applied social scientist, you need to know it. Someone else has written the numerical methods and (hopefully) accounted for numerical issues. However, if you are not even aware that “floating point numbers” are a “thing”, if something goes wrong, it will seem like magic. Also, at least being aware of these problems will help you understand error messages from optimization routines that complaining of “numerical precision”.

Computerphile has a good video on Floating Point Numbers.

Be careful testing equality with floating point numbers:

0.1 + 0.2 == 0.3

## [1] FALSE

0.15 + 0.15 >= 0.1 + 0.2

## [1] FALSE

To test for equality with floating point numbers use all.equal:

all.equal(0.15 + 0.15, 0.1 + 0.2)

## [1] TRUE

For functions that test equality in floating point numbers there will be some sort of “tolerance” argument which is the difference at which two numbers can be equal.

Numbers too large to represent will “overflow” to Inf:

1e1000000

## [1] Inf

This is true even if the final expression would be finite. E.g. $\log(e^10000) = 10000$, but in R,

log(exp(10000))

## [1] Inf

because $e^10000 \approx \infty$ and $\log(\infty) = \infty$.

Models

Some of the discussion of models is slightly different, and has a different emphasis than in most social science research. This is largely because this book is speaking to data scientists, where the primary goal is prediction rather than theory testing (that I don’t view these as too different is a different story).

The discussion about hypothesis generation vs. confirmation is interesting. Too little emphasis is placed on hypothesis generation in social science. The importance of out of sample testing also receives too little emphasis in political science.

And from this discussion it should be clear that many papers in social science are hypothesis generation masquerading as hypothesis confirmation.

Model Basics

Distinction between family of models and fitted model is a useful way to think about models. Especially as we can abstract some families of models to be themselves a fitted model of a more flexible family of models. For example, linear regression is a special case of GLM or Gaussian Processes etc.

NOTE It’s worth mentioning these as more general models. Though they don’t appear as much in social science work. I should try to explain that. I can think of several reasons

preference for easy to explain models (though I think that’s wrong–most people can’t visualize high-dimensional space well, and interpret results marginally even though they are conditional)
status-quo bias and path dependence combined with lack of knowledge of work outside the field and median lack of technical ability to understand or use these models.
the most principled reason is that those more complicated models really excel in prediction. If we take an agnostic approach to regression, as in the Angrist and Pischke books, then regression is not being used to fit $f(y | x)$, its being used to fit $E(f(y | x))$, and more specifically to get some sort of average effect for a change in a specific variable.

Model basics

For more complex models, visualize their

predictions
residuals

see the tidyr::complete, tidyr::expand, and modelr::data_grid functions
modelr::add_residuals and modelr::add_predictions functions add residuals or predictions to the original data
geom_ref_line