Fred Traylor, Lab TA (he/him)
September 1, 2022
Working Directories & Projects
Environments and Packages
Packages
Tidyverse
usdata
Data Management in R
Filtering Rows
Selecting Columns
Tomorrow, we’ll be working with data from outside of R. That is, we’ll be importing our own data files.
Before we can do that, though, we first need to understand where our data and files are currently being saved.
This is called a “Working Directory.” To find it, we can run
getwd()
, short for “get
working directory.”
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
If you haven’t already, let’s go ahead and set up a project. We’ll save everything we do this semester into the project, so it’ll be a handy place to keep everything together. Projects do two useful functions:
They keep everything together
They let us tell R to use the project as a working directory
In the RStudio menu, click File > New Project
In the popup menu, click “New Directory”
Click “New Project”
Give it an appropriate name and choose where it should go in your files
I named mine “ay23_lab_ta”
Other good options include “soc541”, “sociology_stats_1”, etc.
All of my R files are saved in “Documents > R”, so I made it a “subdirectory” within that folder.
You should now be looking at a new project.
Your working directory might have changed, too, and we can look at it
with getwd()
.
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
Since the last OQT, we’ve done:
Viewing the Working Directory
Creating a New Project
Up to now, we’ve used what is called “Base R.” If you go to the “Environment” pane of your R window, there is a dropdown menu to see what is in each environment. Go ahead and click on it and you’ll see a selection of other “packages.”
No need to do anything inside these packages, but now you know how R knows what to do. If we look at the documentation for a function, it gives us the function’s name followed by the name of the package holding in in curly brackets.
For example, ?table
gives us
table {base}
, telling us that the function
table
comes in the base
package built into
R.
If we do it for head()
we see that it comes in the
utilities (utils
) R package.
Generally, we don’t want to mess with the code of any functions or values that come in our other packages because fixing them involves reinstalling a lot of things.
The beauty of R is that, because it is open-source, anybody can add new functions and make them available to anybody. Because these functions often rely on each other, they get packaged together to make for easy (and consistent) usage.
These “packages” are what we’re going to be doing today. These will come in handy during this bootcamp and throughout the entire year as you take statistics.
I generally group packages into two main categories:
As you can guess, there’s a lot of overlap here. We’ll mostly be working with quality of life packages this fall, but we’ll also add some extension packages as well.
When you want to use a package, there are two things you have to do:
You have to install it.
There are likely thousands of packages available on CRAN (where you downloaded R from) and even more available elsewhere online. It’d be too much for R to install every possible package to your computer when you downloaded it the first time, so they’re only available as you want them.
Luckily, packages (like R) are free!
To install a package, you use the
install.packages()
function built into
R.
You have to call it from your library.
After it’s been downloaded, it’s saved on your computer. (YAY!)
But, R doesn’t know which ones you want to use yet, so you will
need to call it into the working library via the
library()
function.
You can think of install.packages()
as buying a tool and
library()
as actually getting it out of the toolbox.
This afternoon, and all throughout this year, we’re also going to use
a popular package (well, set of packages) called the
tidyverse
.
Because it’ll take a minute or two to install, go ahead and type
install.packages("tidyverse")
into R and run it.
The tidyverse is “an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
Basically, it makes it easier to manage, use, and look at our data. Today, we’ll be working with the manage and use parts of this. In a few weeks, we’ll do a little bit with how the tidyverse makes data visualization better and easier than with Base R.
Hopefully, by now, the tidyverse has been installed, so let’s go
ahead and call it into the library with
library(tidyverse)
.
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
You’ll get these warnings every time you load the tidyverse. It’s telling you which packages have been loaded, if the package(s) was built using a newer version of R you have downloaded, and if there are any “conflicting” functions. Generally you don’t need to worry about these. (We’ll deal with them more in the Spring…)
This afternoon, we’re going to work with a dataset of US counties. (In case you don’t know, counties are smaller than state governments but (generally) larger than cities.)
We’re going to use a package called usdata
and a dataset
saved in it called “county
.” To download it, go ahead and
type install.packages("usdata")
into R and run it.
Once that’s loaded up, run library(usdata)
, to bring it
into our environment.
## Warning: package 'usdata' was built under R version 4.1.2
Let’s take a look at the dataframe to see what we’re working with.
## [1] 3142 15
## tibble [3,142 x 15] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:3142] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pop2000 : num [1:3142] 43671 140415 29038 20826 51024 ...
## $ pop2010 : num [1:3142] 54571 182265 27457 22915 57322 ...
## $ pop2017 : int [1:3142] 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
## $ pop_change : num [1:3142] 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num [1:3142] 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num [1:3142] 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num [1:3142] 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num [1:3142] 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
## $ median_edu : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
## $ per_capita_income: num [1:3142] 27842 27780 17892 20572 21367 ...
## $ median_hh_income : int [1:3142] 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
## $ smoking_ban : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...
We can see that we have a lot of observations (3,142) and 15 variables.
Fortunately, the names are fairly descriptive, but we should still
look at the documentation (?county
) to see what everything
means.
It also tell us that the data is saved as a “tibble.”
In the Packages pane, scroll down to the one that says “tibble.” If it isn’t checked, go ahead and check it to bring it to the library.
library(tibble)
, and you’ll
actually see this code run in the console when you check it.The description for “tibble” is “Simple Data Frames.” Indeed, tibbles are the same as data frames, but with a few nice features that make it easier to work with large datasets:
It gives us the data type for each column
When you print to the console, it shortens the output
Only ten observations
Only as many columns as will fit
And gives a summary of what was cut off
Smaller storage