Notes on R for Data Science
Tibbles
Why might you want to create non-syntactic variable names? Since variable names are often used as in plots (e.g. axis-titles) or headers in tables, where having spaces or other characters that are invalid R variable names is useful. Those functions will have ways to use text other than the column, but using the non-syntactic variable name can be convenient.
If nycflights::flights
were printed in the console it would be much worse. Just try it, I dare you.
as.data.frame(nycflights13::flights)
Partial matching is bad! Suppose you have this data frame:
d <- data.frame(foobar = 1:3)
This code works to extract foobar
,
d$foobar
## [1] 1 2 3
but so does this,
d$foo
## [1] 1 2 3
This is called partial matching, since foo
matches foobar
. R uses partial matching with $
and argument names in functions.
This seems convenient. Who wants to type extra characters? Now suppose you have a data frame with two variables foobar
and foo
. However, when creating it you accidentally misspell foo
as fo
.
d <- data.frame(foobar = 1:3, fo = 5:7)
If you run this,
d$foo
## [1] 1 2 3
it will still return a value, and you may not realize your initial error. It would have been better if $
had returned NULL
or raised an error since then you would have realized the bug and fixed it. However, tibble
objects never partially match column names with $
:
library("tibble")
as.tibble(d)$foo
## Warning: Unknown or uninitialised column: 'foo'.
## NULL
Additionally, [[
does not partially match:
d[["foo"]]
## NULL
Partial matching is bad; never do it. Tibbles prevent you from doing it.
Interacting with Older Code
Not all older functions work with tibbles (an example includes the Amelia package). This is because rely may assume the very quirks in data.frame
behavior that tibbles
fix. Use as.data.frame()
to turn a tibble back into a data.frame
. These problems are usually due to [
inconsistently returning a vector or a data frame.
Tidy Data
- Describe what tidy data is and why it is important?
- Provide some examples of non-tidy data
- Be able to convert non-tidy data to tidy data
- Provide some examples when non-tidy data formats would be appropriate or useful
The Rules
- Each variable has its own column
- Each observation must have its own row
- Each value must have its own cell
or even
- Put each dataset in a tibble
- Put each variable in a column
These seem obvious at first, so we need to see examples of not-following tidy data and what happens.
Some nuances:
The definitions of variable, observation, and value are not always clear. And how you store and arrange the data can depend on how you aim to use it. Generally, aim for storing the data in a tidy format that ensures minimal errors. When you model it, you can transform the data later. See non-tidy data.
It is easier to work with variables in columns because of mutate
and summary
functions. It will also work better with tidyverse
functions: e.g. using group_by
to group and summarize, or facet_*
and aesthetics in ggplot2.
The tidy data ideas are adapted from the database normalization, but simplified and adapted to the general uses of practicing data scientists.
Non-tidy data
- Corpus and text data is often stored in sparse Matrices https://cran.r-project.org/web/packages/tm/tm.pdf
- Graphical data has its own format: http://igraph.org/r/doc/
Replication datasets and datasets used in analysis are often non-tidy. Why?
Dates and Times
- Ideas for applications: CDB90 data, COW war start end and duration
- Read more on time-zones: https://en.wikipedia.org/wiki/Time_zone
- Computerphile The Problem with Time & Timezones - Computerphile
- The history of the tz database are themselves interesting: https://en.wikipedia.org/wiki/Tz_database
- A literary appreciation of the Olson/Zoneinfo/tz database
Relational Data
- What is relational data?
- What is a primary key?
- List and describe the types of joins
- What are set union, intersection, complement, and difference?
nycflights13 is an example of a data-only R package. R packages can contain both functions and data. Since data-sets can get large, often they can be packaged as their own dataset. These sorts of data-only R packages make it convenient for R users to access your data, but it should not be the only way you provide your research data. Not everyone uses R, so the original data should be provided in a program agnostic format (e.g. CSV files). This also holds for those using Stata; they should not be distributing data in .dta
format files specific to Stata (even if as we saw earlier, other programs can read that data.) Another example of a data-only R package is the gapminder package.
Strings
The discussion of encodings is detailed, but these details can make your life hell. Skim now, but be aware that what should be simple, actually is not.
After reading this you should be able to answer:
- What are character encodings? Why might they be an issue in data cleaning?
- What are regular expressions? When would you use them.
You don’t need to memorize the various types of character encoding and regular expressions. For the former, knowing the existence of it and being able to recognize when you are having a character encoding issue is the key skill. After that Google is your friend. For the latter, you should be aware that regular expressions are a way to match string patterns. When you need to match a string pattern, look up the regular expression. You will learn regular expressions over time.
The rex
package functions that allow you to write regular expressions with R functions rather than the more compact but sometimes difficult to interpret regex syntax.
Warning There are several variants of regular expressions out there. All use a similar grammar. However, each language or library may implement slightly different features.
This Computerphile video on Unicode is great Characters, Symbols and the Unicode Miracle - Computerphile
This suggested reading is very useful: http://kunststube.net/encoding/
Also see:
- Character encoding Stat 545. Jenny Bryan.
- Character data. Stat 545. Jenny Bryan.
- Regular expressions in R. Stat 545. Jenny Bryan.
Dates and Times
This section seems less complete than the others. Refer to the lubridate vignette for more information.
Vectors
- What is the difference between atomic vectors and lists?
- What is the difference between numeric vectors and integer vectors?
- List several problems that may occur with floating point arithmetic
Why does floating point math matter? For 99% of the work you are likely to do as an applied social scientist, you need to know it. Someone else has written the numerical methods and (hopefully) accounted for numerical issues. However, if you are not even aware that “floating point numbers” are a “thing”, if something goes wrong, it will seem like magic. Also, at least being aware of these problems will help you understand error messages from optimization routines that complaining of “numerical precision”.
Computerphile has a good video on Floating Point Numbers.
Be careful testing equality with floating point numbers:
0.1 + 0.2 == 0.3
## [1] FALSE
0.15 + 0.15 >= 0.1 + 0.2
## [1] FALSE
To test for equality with floating point numbers use all.equal
:
all.equal(0.15 + 0.15, 0.1 + 0.2)
## [1] TRUE
For functions that test equality in floating point numbers there will be some sort of “tolerance” argument which is the difference at which two numbers can be equal.
Numbers too large to represent will “overflow” to Inf
:
1e1000000
## [1] Inf
This is true even if the final expression would be finite. E.g. \(\log(e^10000) = 10000\), but in R,
log(exp(10000))
## [1] Inf
because \(e^10000 \approx \infty\) and \(\log(\infty) = \infty\).
Models
Some of the discussion of models is slightly different, and has a different emphasis than in most social science research. This is largely because this book is speaking to data scientists, where the primary goal is prediction rather than theory testing (that I don’t view these as too different is a different story).
The discussion about hypothesis generation vs. confirmation is interesting. Too little emphasis is placed on hypothesis generation in social science. The importance of out of sample testing also receives too little emphasis in political science.
And from this discussion it should be clear that many papers in social science are hypothesis generation masquerading as hypothesis confirmation.
Model Basics
Distinction between family of models and fitted model is a useful way to think about models. Especially as we can abstract some families of models to be themselves a fitted model of a more flexible family of models. For example, linear regression is a special case of GLM or Gaussian Processes etc.
NOTE It’s worth mentioning these as more general models. Though they don’t appear as much in social science work. I should try to explain that. I can think of several reasons
- preference for easy to explain models (though I think that’s wrong–most people can’t visualize high-dimensional space well, and interpret results marginally even though they are conditional)
- status-quo bias and path dependence combined with lack of knowledge of work outside the field and median lack of technical ability to understand or use these models.
- the most principled reason is that those more complicated models really excel in prediction. If we take an agnostic approach to regression, as in the Angrist and Pischke books, then regression is not being used to fit \(f(y | x)\), its being used to fit \(E(f(y | x))\), and more specifically to get some sort of average effect for a change in a specific variable.
Model basics
For more complex models, visualize their
- predictions
- residuals
- see the
tidyr::complete
,tidyr::expand
, andmodelr::data_grid
functions modelr::add_residuals
andmodelr::add_predictions
functions add residuals or predictions to the original datageom_ref_line