2 Proposal

In one to two paragraphs propose a research project for this course. State your research question. Describe what data you will be using to address that question.

  1. Address any comments from the first proposal.
  2. What is your research question or puzzle?
  3. If you have a clear question: is it descriptive, predictive, or causal (or some combination of them). What is the outcome variable of interest? What are the explanatory variables? If causal, what is your identification strategy (of the few methods you have seen thus far).
  4. What data will you be using? What is the location of the data? Include links to the data. Will your data be coming from one of many sources? What format(s) are the data in? How do you intend to operationalize the concepts in your question with the data at hand?
  5. What do you expect to be the problems you will have to overcome in the analysis?

Try to answer these questions as best as possible. If you don’t have an answer yet, state it. We’ll discuss and resolve these issues in the comments. If you are having problems describe them; that’s why item 6 is there. Research projects are an iterative and stochastic process. The purpose of these assignments is to focus you on certain parts, find the problems and overcome them.

Exact instructions of directory organization to follow.

The purpose of this assignment is to encourage you to find, load, clean, and explore your data before continuing to analysis.

  1. State your research question, and your expected research design to answer that question.
  2. What are your data sources? Cite the data appropriately using name or title of the data set, authors, version, creation date, and persistent data identifier (e.g. DOI) if available or URL if one is not.
  3. Load, describe, and clean your data. Explain the steps you are taking where approporiate. Some questions to consider are:

  4. Check for missing values. In some cases, some numeric values correspond to missing values. These are often extremely large or small values.
  5. What are types (string, numeric, integer, date, etc.) and anticipated range of values that variables can take? Do they take those values in the data?
  6. What is the unit of observation of your data? What will the unit of observation be for your analysis? Do you need to merge datasets? If so, what is their common identifier?

  7. If possible, produce some summary statistcs and appropropriate plots of the distributions of the variables you expect to use as the response and explanatory variables.
  8. What issues did you encounter in getting and cleaning the data? What issues remain?

Write this asisgnment in a file named data-analysis-2.Rmd in your repository. It may-be useful to copy material the previous assignment as a starting point since this should be viewed as a draft towards your final project.

To submit this assignment:

  1. Open a pull request titled “Review Data Analysis 2”
  2. Assign it to @jrnold and @CalvinGarner
  3. See the instructions on what to include in the comment.

2.1 Instructions

This assignment should be viewed as a step towards producing a final project. It is both a commitment device for you to work on the project throughout the course, and an opportunity for you to get feedback and suggestions from the instructors in a manner that you can learn from and incorporate those suggestions into your project.

The following questions are things that you should attempt to address in this assignment. However, research projects can progress at different paces due to variety of factors. Turn in this assignment on time, with whatever you are able to do. Clearly and honestly describe any problems that you are having with your project. Problems are inevitable in a research project. Stating them here will provide the instructors the ability to work with you to address and overcome these problems.

  1. Rewrite your document to be flow like a data analysis. The document should be prose describing a research question and the steps you are taking to answer that research question along with code that is actually implementing those steps. Consult the examples of literate programming provided in Slack.

  2. In this draft, things you should focus on are:

    1. Clearly describing the data you are using. Do not simply load it and print it. Describe the relevant variables. Check for potential problems of (e.g. missing data) and note any unresolved problems in the data relevant to your analysis. Provide any output, tables, and figures that are necessary for the reader to understand the data you will use to answer your research question.

    2. Propose and clearly state method which you will use to answer your research question. Questions to address include:

      • How will you operationalize the relevant variables?

      • If a causal question: what is your identification strategy: randomization, regression discontinuity, before-and-after, diff-in-diff, or selection on observables? In general, consider how you are generating the counterfactual. Implement that method.

      • If your identification strategy is selection on observables: What variables are possible confounders? Do you have observable measures for them? Think of at least one possible confounder that you don’t have a measure for.

      • If a predictive question: What is the predictive task? How are you assessing predictive performance?

      • If a descriptive question: What is being described? How does this improve on our previous understanding?

  3. Address all points raised in previous comments. This may or may-not require making changes to your analysis and write-up. However, note and address all of these concerns in your comments.

Address theoretical, data, and methodological issues from the previous data analysis assignments.

At this stage, your data wrangling should be complete, and that section should be mostly written.

In this draft you should make some attempt at analysis that answers your research question. Consider the following.

  1. Is it a causal question? What is the identification strategy? What assumptions are being made, and are they plausible?
  2. Is it a predictive question? How are your measuring predictive performance? Are the out-of-sample tests appropriate?
  3. Is it a descriptive question? Why are your measures appropriate?

If due to data issues, you cannot run an analysis, clearly describe your plan. Do your best to run some analysis. We will use next week to work on your projects. Having something written, no matter how “imperfect”, provides a basis for discussion and improvement.

Details will be posted later.