Session 2: Data Management in R

Rutgers University Sociology R Bootcamp

Fred Traylor, Lab TA (he/him)

September 1, 2022

Good Afternoon!

This Afternoon’s Goals

  1. Working Directories & Projects

  2. Environments and Packages

  3. Packages

    1. Tidyverse

    2. usdata

  4. Data Management in R

    1. Filtering Rows

    2. Selecting Columns

Directories and Projects

Working Directories

Tomorrow, we’ll be working with data from outside of R. That is, we’ll be importing our own data files.

Before we can do that, though, we first need to understand where our data and files are currently being saved.

This is called a “Working Directory.” To find it, we can run getwd(), short for “get working directory.”

getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"

Projects

If you haven’t already, let’s go ahead and set up a project. We’ll save everything we do this semester into the project, so it’ll be a handy place to keep everything together. Projects do two useful functions:

Let’s create a new project

  1. In the RStudio menu, click File > New Project

  2. In the popup menu, click “New Directory”

  3. Click “New Project”

  4. Give it an appropriate name and choose where it should go in your files

    • I named mine “ay23_lab_ta”

    • Other good options include “soc541”, “sociology_stats_1”, etc.

      • We’ll be making another project for 542 (Stats II) in the spring.
    • All of my R files are saved in “Documents > R”, so I made it a “subdirectory” within that folder.

You should now be looking at a new project.

Your working directory might have changed, too, and we can look at it with getwd().

getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"

Official Question Time 1

Since the last OQT, we’ve done:

  1. Viewing the Working Directory

  2. Creating a New Project

Environments and Packages

Other R Environments

Up to now, we’ve used what is called “Base R.” If you go to the “Environment” pane of your R window, there is a dropdown menu to see what is in each environment. Go ahead and click on it and you’ll see a selection of other “packages.”

No need to do anything inside these packages, but now you know how R knows what to do. If we look at the documentation for a function, it gives us the function’s name followed by the name of the package holding in in curly brackets.

Generally, we don’t want to mess with the code of any functions or values that come in our other packages because fixing them involves reinstalling a lot of things.

Introduction to Packages

The beauty of R is that, because it is open-source, anybody can add new functions and make them available to anybody. Because these functions often rely on each other, they get packaged together to make for easy (and consistent) usage.

These “packages” are what we’re going to be doing today. These will come in handy during this bootcamp and throughout the entire year as you take statistics.

I generally group packages into two main categories:

  1. Quality of life packages make R easier to use or simplify code
  2. Extension packages add capabilities that would otherwise be complicated in Base R

As you can guess, there’s a lot of overlap here. We’ll mostly be working with quality of life packages this fall, but we’ll also add some extension packages as well.

Preparing to Use Packages

When you want to use a package, there are two things you have to do:

  1. You have to install it.

    • There are likely thousands of packages available on CRAN (where you downloaded R from) and even more available elsewhere online. It’d be too much for R to install every possible package to your computer when you downloaded it the first time, so they’re only available as you want them.

    • Luckily, packages (like R) are free!

    • To install a package, you use the install.packages() function built into R.

  2. You have to call it from your library.

    • After it’s been downloaded, it’s saved on your computer. (YAY!)

    • But, R doesn’t know which ones you want to use yet, so you will need to call it into the working library via the library() function.

You can think of install.packages() as buying a tool and library() as actually getting it out of the toolbox.

Intro to the Tidyverse

This afternoon, and all throughout this year, we’re also going to use a popular package (well, set of packages) called the tidyverse.

Because it’ll take a minute or two to install, go ahead and type install.packages("tidyverse") into R and run it.

The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Basically, it makes it easier to manage, use, and look at our data. Today, we’ll be working with the manage and use parts of this. In a few weeks, we’ll do a little bit with how the tidyverse makes data visualization better and easier than with Base R.

Hopefully, by now, the tidyverse has been installed, so let’s go ahead and call it into the library with library(tidyverse).

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

You’ll get these warnings every time you load the tidyverse. It’s telling you which packages have been loaded, if the package(s) was built using a newer version of R you have downloaded, and if there are any “conflicting” functions. Generally you don’t need to worry about these. (We’ll deal with them more in the Spring…)

The County Dataset

This afternoon, we’re going to work with a dataset of US counties. (In case you don’t know, counties are smaller than state governments but (generally) larger than cities.)

We’re going to use a package called usdata and a dataset saved in it called “county.” To download it, go ahead and type install.packages("usdata") into R and run it.

Once that’s loaded up, run library(usdata), to bring it into our environment.

library(usdata)
## Warning: package 'usdata' was built under R version 4.1.2

Let’s take a look at the dataframe to see what we’re working with.

View(county)

dim(county)
## [1] 3142   15
str(county)
## tibble [3,142 x 15] (S3: tbl_df/tbl/data.frame)
##  $ name             : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ state            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pop2000          : num [1:3142] 43671 140415 29038 20826 51024 ...
##  $ pop2010          : num [1:3142] 54571 182265 27457 22915 57322 ...
##  $ pop2017          : int [1:3142] 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
##  $ pop_change       : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
##  $ median_edu       : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
##  $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
##  $ median_hh_income : int [1:3142] 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
##  $ smoking_ban      : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...

We can see that we have a lot of observations (3,142) and 15 variables.

Fortunately, the names are fairly descriptive, but we should still look at the documentation (?county) to see what everything means.

It also tell us that the data is saved as a “tibble.”

In the Packages pane, scroll down to the one that says “tibble.” If it isn’t checked, go ahead and check it to bring it to the library.

The description for “tibble” is “Simple Data Frames.” Indeed, tibbles are the same as data frames, but with a few nice features that make it easier to work with large datasets:

  1. It gives us the data type for each column

  2. When you print to the console, it shortens the output

    • Only ten observations

    • Only as many columns as will fit

    • And gives a summary of what was cut off

  3. Smaller storage

Viewing Our New Packages