Session 3: Advanced Data Management in R

Rutgers University Sociology R Bootcamp

Fred Traylor, Lab TA (he/him)

September 2, 2022

Good Morning and Welcome Back!

This Morning’s Goals

  1. Missing Data

  2. Importing Data

  3. Cleaning Data

    • Editing Variable Values

    • Creating New Variables

Missing Data

How does Missing Data Come Up

Let’s create some purposefully so we can look at it.

missvec <- c(45, 89, 7, NA, 98, NA, 3, 45, 7)

We can use the is.na() function to see which values in our vector are missing.

is.na(missvec)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE
sum(is.na(missvec))
## [1] 2
table(is.na(missvec))
## 
## FALSE  TRUE 
##     7     2

When we have missing data, functions applied to a vector are calculated for all values except the missings:

missvec * 3
## [1] 135 267  21  NA 294  NA   9 135  21


When we use the table() function on data, we can’t see the missing values unless we include the useNA = "ifany" argument, like this:

table(missvec)
## missvec
##  3  7 45 89 98 
##  1  2  2  1  1
table(missvec, useNA = "ifany")
## missvec
##    3    7   45   89   98 <NA> 
##    1    2    2    1    1    2

Now we can see the two missing values.

We could also set it to useNA = "always", which includes a spot for missings in the table, even if there aren’t any.

nomiss <- c("karen", "quan", "tom", "kristen", "paul", "karen", "paul", "karen")
table(nomiss)
## nomiss
##   karen kristen    paul    quan     tom 
##       3       1       2       1       1
table(nomiss, useNA = "ifany")
## nomiss
##   karen kristen    paul    quan     tom 
##       3       1       2       1       1
table(nomiss, useNA = "always")
## nomiss
##   karen kristen    paul    quan     tom    <NA> 
##       3       1       2       1       1       0

Types of Missing Data

Remember how we have different types of data:


There are also different types of missing values. Right now, we’re only going to use four:

Note that the last three types of missings have an underscore (`_`) at the end of them.

The others might come up this next year, but we see them so rarely that we won’t use them now.

When in doubt, run typeof() to see how it is stored.

a <- 1       # A number without L becomes double
b <- 3L      # Integers are created by putting a capital L after the number
c <- "word"
d <- TRUE
typeof(a)
## [1] "double"
typeof(b)
## [1] "integer"
typeof(c)
## [1] "character"
typeof(d)
## [1] "logical"
typeof(missvec)
## [1] "double"


You might remember from yesterday that we used a function class(). These are very similar, but typeof() will tell us if it is integer or double, making it more helpful in determining if we need to use NA or NA_integer_.

a <- 1
b <- 3L      # Integers are created by putting a capital L after the number
typeof(a)
## [1] "double"
class(a)
## [1] "numeric"
typeof(b)
## [1] "integer"
class(b)
## [1] "integer"

Practicing with Missing Data

Let’s create some missing data:

misstext <- c("Blue","Green", "Orange", NA_character_, "Red", "Blue", "Orange", NA, "NA")
misstext
## [1] "Blue"   "Green"  "Orange" NA       "Red"    "Blue"   "Orange" NA      
## [9] "NA"
table(misstext, useNA = "ifany")
## misstext
##   Blue  Green     NA Orange    Red   <NA> 
##      2      1      1      2      1      2

Note here that it didn’t matter whether we used NA or NA_character_ when we created the vector, as both register as <NA> in the table.

One value labeled "NA" was kept as text, though. Because we used quotation marks around it when we created the vector, R thought we wanted it as text and left it undisturbed.

Now, let’s combine our two vectors into a tibble.

library(tidyverse)         # Remember to call this in or the below code won't work 
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
misstibble <- tibble(missvec,
                     misstext) %>% 
  rename(numbers = missvec,
         colors = misstext)
misstibble
## # A tibble: 9 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      NA <NA>  
## 5      98 Red   
## 6      NA Blue  
## 7       3 Orange
## 8      45 <NA>  
## 9       7 NA

We can see that some rows aren’t missing anything, while others are missing one or even two values.

Removing Data with Missing Values

If we want to keep only the rows that have no missing data, we can use the drop_na() function.

drop_na(misstibble)
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      98 Red   
## 5       3 Orange
## 6       7 NA
misstibble %>% drop_na()
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      98 Red   
## 5       3 Orange
## 6       7 NA

Removing Missing Data From Specific Columns

drop_na() also has the ability to remove observations that are missing on specific columns.

misstibble %>% drop_na()
## # A tibble: 6 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      98 Red   
## 5       3 Orange
## 6       7 NA
misstibble %>% drop_na(numbers)  # Drop observation that have missing for `numbers`
## # A tibble: 7 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      98 Red   
## 5       3 Orange
## 6      45 <NA>  
## 7       7 NA
misstibble %>% drop_na(colors)   # Drop observation that have missing for `colors`
## # A tibble: 7 x 2
##   numbers colors
##     <dbl> <chr> 
## 1      45 Blue  
## 2      89 Green 
## 3       7 Orange
## 4      98 Red   
## 5      NA Blue  
## 6       3 Orange
## 7       7 NA

This is useful where there is a lot of missingness throughout the dataset and we don’t want to remove observations that are missing on variables we don’t care about.

Official Question Time 1

Since we started today, we’ve done:

  1. Missing data

    1. How to create it

    2. Types of it

      • NA

      • NA_character_

      • NA_real_

      • NA_integer_

    3. How to detect it

      • is.na()
    4. How to remove it

      • drop_na()

Clearing the Environment

We’re about to change gears a little and won’t need these objects we just saved. Let’s remove them from the global environment so they won’t clutter up our workspace.

ls()
## [1] "a"          "b"          "c"          "d"          "misstext"  
## [6] "misstibble" "missvec"    "nomiss"
rm(list=ls())
ls()
## character(0)

Starting with the GSS

Let’s Clean Up the GSS

This morning, I sent you a file containing data from the General Social Survey (GSS).

Go ahead and save that into your working directory, then click on it to import it into your Global Environment.

getwd()
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
load("~/R/ay23_lab_ta/bootcamp/rawgss.RData")

Using our imported GSS data, and what we’ve learned so far on data management, let’s create some variables.

Creating a New Dataset

Looking at the size of the dataset (dim(GSS) = 2348, 89), we can see that there are way more variables than we want right now. Let’s create a new dataset “gss_bootcamp”, and select only a few of the variables we want.

  1. Demographic factors

    1. ID_ (Respondent’s ID)

    2. AGE (age in years)

    3. EDUC (years of education)

  2. Political Variables

    1. PARTYID (Party Identification)

    2. POLVIEWS (Political Ideology)

  3. Economic Variables

    1. PRESTG10 (occupational prestige score)

    2. INCOME (annual income in categories)

    3. TVHOURS (hours of TV watched last week)

  4. Quality of Life

    1. HAPPY (General Happiness)

    2. HEALTH (General Health)

  5. We also want a set of variables asking what they think about government spending on a series of factors. These all start with “NAT” and end with a descriptor of the topic.

Go ahead and try this yourself, then go to the next slide to see how I did it.

Smaller Dataset

library(tidyverse)
gss_bootcamp <- GSS %>% 
  as_tibble() %>% 
  select(ID_, AGE, EDUC,            # Demographics
         PARTYID, POLVIEWS,         # Political Variables
         PRESTG10, INCOME, TVHOURS,  # Economic Variables
         HAPPY, HEALTH,             # Quality of Life
         starts_with("NAT")         # National Spending 
         ) 

Now, the dataset is much smaller with only 28 variables.

The Beauty of Source Code

Actually, let’s add sex and race to the dataset too.

If we didn’t do this as source, we’d have to entirely retype the previous section of code.

But instead, we can simply go back and add “SEX” and “RACE” to our code and rerun the section.

gss_bootcamp <- GSS %>% 
  as_tibble() %>% 
  select(ID_, AGE, EDUC, SEX, RACE,      # Demographics
         PARTYID, POLVIEWS,              # Political Variables
         PRESTG10, INCOME, TVHOURS,      # Economic Variables
         HAPPY, HEALTH,                  # Quality of Life
         starts_with("NAT")              # National Spending 
         ) 

This is why it’s also important not to overwrite the original file (“GSS”).

And now, let’s take a look at our dataset with the View(gss_bootcamp) function to see it in our source pane, or print(gss_bootcamp) to see it in the console. Go ahead and pick one (or both).

print(gss_bootcamp)
## # A tibble: 2,348 x 30
##      ID_   AGE  EDUC   SEX  RACE PARTYID POLVIEWS PRESTG10 INCOME TVHOURS HAPPY
##    <int> <int> <int> <int> <int>   <int>    <int>    <int>  <int>   <int> <int>
##  1     1    43    14     1     1       5        6       47     13       3     2
##  2     2    74    10     2     1       2        8       22     12      -1     1
##  3     3    42    16     1     1       4        5       61     12       1     1
##  4     4    63    16     2     1       2        4       59     13       1     1
##  5     5    71    18     1     2       6        7       53     13      -1     2
##  6     6    67    16     2     1       2        3       53     98      10     3
##  7     7    59    13     2     2       0        4       48     10      -1     2
##  8     8    43    12     1     1       5        5       35     12      -1     2
##  9     9    62     8     2     1       3        4       35      5       4     3
## 10    10    55    12     1     1       1        8       39     12       2     2
## # ... with 2,338 more rows, and 19 more variables: HEALTH <int>, NATFARE <int>,
## #   NATROAD <int>, NATSOC <int>, NATMASS <int>, NATPARK <int>, NATCHLD <int>,
## #   NATSCI <int>, NATENRGY <int>, NATAID <int>, NATARMS <int>, NATSPAC <int>,
## #   NATENVIR <int>, NATHEAL <int>, NATCITY <int>, NATCRIME <int>,
## #   NATDRUG <int>, NATEDUC <int>, NATRACE <int>

We see that all we have are numbers. Next up: how to make these numbers cleaner for us to work with.

Official Question Time 2

Since the last OQT, we’ve done:

  1. Creating a new dataset from our old one

  2. Selecting variables of interest

Data Management With Our New Dataset

Mutating Variables

Let’s start by adding a variable onto our dataset.

The tidyverse package dplyr gives us a great tool for mutating our variables. It’s called, aptly enough, mutate().

It takes the form mutate(variable=value).

Let’s say we want to add a column to our gss_bootcamp where every value is 123456. We can do that with:

gss_bootcamp <- gss_bootcamp %>% 
  mutate(numbers = 123456) 
gss_bootcamp %>% select(numbers)
## # A tibble: 2,348 x 1
##    numbers
##      <dbl>
##  1  123456
##  2  123456
##  3  123456
##  4  123456
##  5  123456
##  6  123456
##  7  123456
##  8  123456
##  9  123456
## 10  123456
## # ... with 2,338 more rows

We told it that the column “numbers” should contain 123456. Because the column didn’t exist, it added it onto our existing dataset.

We can also use mutate to reference columns within the same dataset, or even the column itself. For example:

gss_bootcamp %>% 
  mutate(newnumbers = numbers,
         numbers = numbers - 100000,
         bignumbers = numbers * 2) %>% 
  select(numbers, newnumbers, bignumbers)
## # A tibble: 2,348 x 3
##    numbers newnumbers bignumbers
##      <dbl>      <dbl>      <dbl>
##  1   23456     123456      46912
##  2   23456     123456      46912
##  3   23456     123456      46912
##  4   23456     123456      46912
##  5   23456     123456      46912
##  6   23456     123456      46912
##  7   23456     123456      46912
##  8   23456     123456      46912
##  9   23456     123456      46912
## 10   23456     123456      46912
## # ... with 2,338 more rows

We gave it our dataset, piped it down, and told R to mutate three columns

  1. newnumbers, a new column, should take the value of numbers

  2. numbers, an existing column, should take the value numbers - 100000,

    • In practice it’s better to add new columns than to overwrite existing ones, but we didn’t save this so we’re not at risk of overwriting.
  3. bignumbers, a new column, should take the value numbers * 2.

    • Also, note that it took the new value of numbers, not the original value, since it came after we changed the value.

Recoding with ifelse()

One of the most basic computing functions is ifelse(). It takes the form ifelse(test, yes, no). In other words, “if the data passes the test with TRUE, follow the yes condition; else, follow the no condition.”

To illustrate, let’s make a small vector:

smallvec <- c(1,2,3,4,5,6,7)

ifelse(smallvec<4, 1, 0)
## [1] 1 1 1 0 0 0 0

Here’s what it did:

  1. Take each value in smallvec

  2. Test if the value is less than 4

  3. If TRUE, return 1

  4. If FALSE, return 0

We can also do this and make it return whatever we want. For example

ifelse(smallvec<4, "Small","Big" )
## [1] "Small" "Small" "Small" "Big"   "Big"   "Big"   "Big"
ifelse(smallvec<4, 234/5, 847^2)
## [1]     46.8     46.8     46.8 717409.0 717409.0 717409.0 717409.0
ifelse(smallvec<4, smallvec, smallvec^3)
## [1]   1   2   3  64 125 216 343


So now let’s recode our variable. In the GSS, SEX is coded 1 = male, 2 = female, but we only have the numbers in our dataset. We can use our ifelse() function to give these values text names. Let’s create a new variable called sex_text that has text values for the variable sex.

gss_bootcamp <- gss_bootcamp %>% 
  mutate(sex_text = ifelse(SEX == 1, "Male", "Female"))

After recoding like this, I like to go back and make a table() to make sure my code did what I wanted.

table(gss_bootcamp$sex_text, gss_bootcamp$SEX, useNA = "ifany")
##         
##             1    2
##   Female    0 1296
##   Male   1052    0

Recoding with case_when()

We can also do this more explicitly using case_when(), a function from the tidyverse. case_when() lets us do variable management with logical tests in a way that is easy to follow and understand.

It takes the form

dataset %>%                         # Dataset
  mutate(variable = case_when(      # Mutate a variable depending on a case
    condition ~ value,
    condition ~ value
  ))

What does this mean?

  1. We start with our dataset, as always.

  2. Then, we tell R to mutate and give it our variable.

    • Both these steps are same as before
  3. Then we say, “actually, instead of applying the same value (or value calculation) for everybody, R should assign the value depending on a condition”

    • The tilde used to assign is in the top left corner of your keyboard, next the the 1
    • Commas go after all condition lines except the last
      • Common Error #2: Forgetting to separate with a comma
  4. Lastly, we close our parentheses

So we can code sex again in a different way, this time using case_when().

gss_bootcamp <- gss_bootcamp %>% 
  mutate(sex_text_cw = case_when(
    SEX == 1 ~ "Male",          # If SEX == 1, then assign the value "Male"
    SEX == 2 ~ "Female",        # If SEX == 2, then assign the value "Female"
  )) 

See what we did there?

  1. Start with our data and pipe it down

  2. mutate our data so that sex_text_cw takes a value dependent on the following conditions

    • If SEX == 1, then assign the value “Male”

    • If SEX == 2, then assign the value “Female”

Lastly, let’s create two tables showing that the coding worked exactly as we wanted and there are no missing values.

table(gss_bootcamp$sex_text_cw, gss_bootcamp$SEX, useNA = "ifany") # Checking our work with the original variable
##         
##             1    2
##   Female    0 1296
##   Male   1052    0
table(gss_bootcamp$sex_text_cw, gss_bootcamp$sex_text, useNA = "ifany") # Checking our work with the ifelse variable from earlier 
##         
##          Female Male
##   Female   1296    0
##   Male        0 1052

Using case_when() with Three or More Groups

As we just saw, case_when() is little more than a string of ifelse() values.

We can do the same thing as case_when() using ifelse() but it get’s tricky with more than two levels. For example, let’s recode RACE to give it text values.

gss_bootcamp <- gss_bootcamp %>% 
  mutate(race3_ifelse = ifelse(RACE == 1, "White",       # If race is 1, name it White
                               ifelse(RACE==2, "Black",  # If it isn't 1, now test: If race is 2, name it Black
                                      "Other")))         # If it isn't 2, now name it Other 

table(gss_bootcamp$race3_ifelse, gss_bootcamp$RACE, useNA = "ifany")
##        
##            1    2    3
##   Black    0  385    0
##   Other    0    0  270
##   White 1693    0    0

It totally works, but is kinda hard to follow. Here’s the same thing using case_when().

gss_bootcamp <- gss_bootcamp %>% 
  mutate(race3 = case_when(
    RACE == 1 ~ "White",
    RACE == 2 ~ "Black",
    RACE == 3 ~ "Other"))

table(gss_bootcamp$race3, gss_bootcamp$RACE, useNA = "ifany")
##        
##            1    2    3
##   Black    0  385    0
##   Other    0    0  270
##   White 1693    0    0

With this, it is easier to follow what the conditions and output values are.

Recoding Categorical Variables

Congrats! You now know the basics of data management!

Now, we’re going to do a lot of practice on this. We’re going to do a few variables together, then have a big chunk of time to work on this separately.

Earlier, I knew the values of SEX and RACE off the top of my head because I’ve worked with them a lot. But the GSS has hundreds of variables, and our small one still has 30. We don’t have to remember the values; instead, we can turn to the codebook for more detail. A codebook is a listing of all the variables that says what the values they give in the dataset actually mean.

The first variable we’re going to work with is HAPPY. Let’s find the variable in GSS’s online codebook to get started.

  1. Go to https://gssdataexplorer.norc.org/

  2. Click “SEARCH VARIABLES” (no account required)

  3. Type “happy” into the search bar and click it when it pops up

We see there that the data is coded as follows:

Code Label
1 Very happy
2 Pretty happy
3 Not too happy
8 Don’t know
9 No answer
0 Not applicable


Let’s create a ‘text’ version of this variable that uses the labels instead of the codes. We can use case_when() to do this. In the spirit of making our names short but descriptive, and not overwriting anything, let’s call it happy_text.

gss_bootcamp <- gss_bootcamp %>% 
  mutate(happy_text = case_when(
    HAPPY == 1 ~ "Very Happy",
    HAPPY == 2 ~ "Pretty Happy",
    HAPPY == 3 ~ "Not too Happy",
    HAPPY > 3  ~ NA_character_           # Since there aren't any 0's, we don't need to add a line for 0's
  ))

After coding, let’s create a table of the old and new variables to compare. Remember to tell R to show the missings, too, so we can make sure they were property coded.

table(gss_bootcamp$HAPPY, gss_bootcamp$happy_text, useNA = "always")
##       
##        Not too Happy Pretty Happy Very Happy <NA>
##   1                0            0        701    0
##   2                0         1307          0    0
##   3              336            0          0    0
##   8                0            0          0    4
##   <NA>             0            0          0    0

Great, it all works as we expected!

Recoding Party ID

Let’s repeat the previous procedure for PARTYID. We see the following coding:

Code Label
0 Strong democrat
1 Not str democrat
2 Ind, near dem
3 Independent
4 Ind, near rep
5 Not str republican
6 Strong republican
7 Other party
8 Don’t know
9 No answer


Let’s condense this down to three categories:

  1. Democrats (0-2)

  2. Republicans (4-6)

  3. Independents / Other Party (3 & 7)

  4. Don’t Know / No Answer (8 & 9) - set as missing


Because the codes don’t line up easily with our categories, we can use the %in% operator to help us out.

Let’s first create a set of vectors that contain the values for each category.

democratcodes <- c(0,1,2)
republicancodes <- c(4,5,6)
independentcodes <- c(3,7)
dknacodes <- c(8,9)

Then, using our %in% operator, we can recode partyid a little easier now.

gss_bootcamp <- gss_bootcamp %>% 
  mutate(party3 = case_when(                      # NOTE that I created a new variable instead of overwriting the old one
    PARTYID %in% democratcodes ~ "Democrat",
    PARTYID %in% republicancodes ~ "Republican",
    PARTYID %in% independentcodes ~ "Independent",
    PARTYID %in% dknacodes ~ NA_character_,
  ))

table(gss_bootcamp$PARTYID, gss_bootcamp$party3, useNA = "always")   # We can use the table feature to make sure everybody is where they're supposed to be
##       
##        Democrat Independent Republican <NA>
##   0         379           0          0    0
##   1         352           0          0    0
##   2         307           0          0    0
##   3           0         414          0    0
##   4           0           0        259    0
##   5           0           0        272    0
##   6           0           0        255    0
##   7           0          77          0    0
##   9           0           0          0   33
##   <NA>        0           0          0    0

Recoding Numeric Variables

Some of our variables, like AGE and TVHOURS are numeric. Let’s take a quick look at them:

table(gss_bootcamp$AGE)
## 
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 
## 22 26 15 27 40 29 38 43 31 39 45 43 50 34 43 42 65 40 40 43 38 55 41 40 40 39 
## 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 
## 40 29 42 30 33 31 37 41 29 48 35 48 50 39 37 46 46 39 24 37 33 39 36 27 39 41 
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 99 
## 45 33 22 20 29 23 19 28 11  9 12  9 12  7 10 12 14  5  8 29  7
typeof(gss_bootcamp$AGE)
## [1] "integer"
table(gss_bootcamp$TVHOURS)
## 
##  -1   0   1   2   3   4   5   6   7   8   9  10  12  14  15  16  17  18  20  24 
## 789 145 349 376 240 169 100  63  13  38   6  23  14   2   4   3   1   1   4   4 
##  98  99 
##   3   1
typeof(gss_bootcamp$TVHOURS)
## [1] "integer"

Yep, they sure are numeric. (Or, at least integer.) But if we look at the end of the table and at the codebook, we can see there’s some weird things happening.


For AGE:

Code Label
89 89 or older
98 Don’t Know
99 No Answer


And for TVHOURS:

Code Label
-1 Not Applicable
98 Don’t Know
99 No Answer


Let’s code all of these as missing and keep the values for everything else. Also, just as a quirk of R, because typeof(gss_bootcamp$AGE) = integer and typeof(gss_bootcamp$TVHOURS) = integer, we use NA_integer_ here instead of just setting them to NA.

gss_bootcamp <- gss_bootcamp %>% 
  mutate(
    newage = case_when(            # You can't go wrong with just calling a new variable "newvariable"
      AGE >= 89 ~ NA_integer_,
      AGE <= 88 ~ AGE
      ),
    newtvhours = case_when(
      TVHOURS >=98 ~ NA_integer_,
      TVHOURS == -1 ~ NA_integer_,
      TRUE ~ TVHOURS
      )
  )

table(gss_bootcamp$newage, useNA = "ifany")
## 
##   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33 
##   22   26   15   27   40   29   38   43   31   39   45   43   50   34   43   42 
##   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49 
##   65   40   40   43   38   55   41   40   40   39   40   29   42   30   33   31 
##   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   37   41   29   48   35   48   50   39   37   46   46   39   24   37   33   39 
##   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80   81 
##   36   27   39   41   45   33   22   20   29   23   19   28   11    9   12    9 
##   82   83   84   85   86   87   88 <NA> 
##   12    7   10   12   14    5    8   36
table(gss_bootcamp$newtvhours, gss_bootcamp$TVHOURS, useNA = "ifany")
##       
##         -1   0   1   2   3   4   5   6   7   8   9  10  12  14  15  16  17  18
##   0      0 145   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   1      0   0 349   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   2      0   0   0 376   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   3      0   0   0   0 240   0   0   0   0   0   0   0   0   0   0   0   0   0
##   4      0   0   0   0   0 169   0   0   0   0   0   0   0   0   0   0   0   0
##   5      0   0   0   0   0   0 100   0   0   0   0   0   0   0   0   0   0   0
##   6      0   0   0   0   0   0   0  63   0   0   0   0   0   0   0   0   0   0
##   7      0   0   0   0   0   0   0   0  13   0   0   0   0   0   0   0   0   0
##   8      0   0   0   0   0   0   0   0   0  38   0   0   0   0   0   0   0   0
##   9      0   0   0   0   0   0   0   0   0   0   6   0   0   0   0   0   0   0
##   10     0   0   0   0   0   0   0   0   0   0   0  23   0   0   0   0   0   0
##   12     0   0   0   0   0   0   0   0   0   0   0   0  14   0   0   0   0   0
##   14     0   0   0   0   0   0   0   0   0   0   0   0   0   2   0   0   0   0
##   15     0   0   0   0   0   0   0   0   0   0   0   0   0   0   4   0   0   0
##   16     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   3   0   0
##   17     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
##   18     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1
##   20     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   24     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   <NA> 789   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##       
##         20  24  98  99
##   0      0   0   0   0
##   1      0   0   0   0
##   2      0   0   0   0
##   3      0   0   0   0
##   4      0   0   0   0
##   5      0   0   0   0
##   6      0   0   0   0
##   7      0   0   0   0
##   8      0   0   0   0
##   9      0   0   0   0
##   10     0   0   0   0
##   12     0   0   0   0
##   14     0   0   0   0
##   15     0   0   0   0
##   16     0   0   0   0
##   17     0   0   0   0
##   18     0   0   0   0
##   20     4   0   0   0
##   24     0   4   0   0
##   <NA>   0   0   3   1

This diagonal line means it’s working. 0 goes to 0, 9 goes to 9, etc. But the one’s we coded to NA (-1, 98, 99) are now <NA>.

In the last line for newtvhours, I add a final statement TRUE ~ TVHOURS. This means “everything left over” that hasn’t already been coded. It is very handy when we’re working with multiple conditions as it takes everything that hasn’t been previously included. Just be careful, though, because you might not want everything that’s left.

Official Question Time 3

Since the last OQT, we’ve done:

  1. Recoding with ifelse()

  2. Recoding with case_when()

  3. Practicing Data Management with Categorical Data

    • SEX –> sex_text

    • RACE –> race3

    • HAPPY –> happy_text

    • PARTYID –> party3

  4. Practicing Data Management With Numeric Data

    • AGE –> newage

    • TVHOURS –> newtvhours

Coding Practice

Let’s take some time to code our data.

Using the online codebook, code the following variables:

  1. INCOME

    • Recode 0, 98, and 99 to missing

    • Everything else keep the same

  2. EDUC

    • Recode 98 and 99 to missing

    • Everything else keep the same

  3. POLVIEWS

    • Code as liberal, moderate, conservative

    • Don’t know, No Answer, and Not Applicable code to missing

  4. Pick three of the national spending variables (starts with “nat”)

    • Don’t know, No Answer, and Not Applicable code to missing

The next slide has snippets of code for if/when you get stuck.

Code Snippets for Selected Variables

gss_bootcamp <- gss_bootcamp %>% 
  mutate(
    #### Sample Recode for income Variable
        newincome = case_when(
          INCOME == 0 ~ NA_integer_,
          INCOME >= 98 ~ NA_integer_,
          TRUE ~ INCOME
          ),

    #### Sample Recode for education Variable
        education = case_when(
          EDUC %in% c(-1, 98, 99) ~ NA_integer_,
          TRUE ~ EDUC
          ),

    #### Sample Recode for polviews Variable
        newpolviews = case_when(
          POLVIEWS %in% c(1, 2, 3) ~ "Liberal",
          POLVIEWS %in% c(5, 6, 7) ~ "Conservative",
          POLVIEWS %in% 4 ~ "Moderate",
          POLVIEWS %in% c(0, 8, 9) ~ NA_character_
          ),
    
    #### Sample Recode for spending Variable
        newnatenvir = case_when(
          NATENVIR %in% c(8, 9, 0) ~ NA_integer_,
          TRUE ~ NATENVIR
          ))

Official Question Time 5

Since the last OQT, we’ve done:

  1. Practicing coding variables

Wrapping Up

One Last Practice

For one last practice, complete the following steps:

  1. GSS Recoding

    1. Load the GSS data from the file

    2. From “GSS,” create a dataset “newgss” that includes the following variables

      1. Race

      2. Sex

      3. Age

      4. Self-ranked social position (“RANK”)

      5. Occupational Prestige

      6. Spouse’s Occupational Prestige (“SPPRES10”)

      7. Frequency of Prayer (“PRAY”)

    3. Recode data and missing data appropriately for the analyses in Parts 3 and 4

    4. Remove anybody with missing values

  2. Show how many rows and columns are in newgss

  3. From newgss, create a new dataset of only those age 50 and up and show how many are in it

  4. From newgss, create a new dataset of high-status people who pray often (how ever you choose to define it), then show many are in it

Answers

load("~/R/ay23_lab_ta/gss_files/rawgss.RData")

# Part 1.1: Load GSS and name it newgss 
newgss <- GSS %>% 
  
  # Part 1.2: Variable selection
  select(RACE, SEX, AGE, RANK, 
         PRESTG10, SPPRES10, PRAY) %>% 
  
  # Part 1.3: Recoding Data (Including Missing Data) 
  mutate(
    newage = case_when(
      AGE >= 89 ~ NA_integer_,
      AGE <= 88 ~ AGE
      ),
    socpos = case_when(
      RANK %in% c(0, 98, 99) ~ NA_integer_,
      TRUE ~ RANK 
      ),
    prayer_freq = case_when(
      PRAY == 1 ~ "Several Times/Day",
      PRAY == 2 ~ "1ce/Day",
      PRAY == 3 ~ "Several Times/Week",
      PRAY == 4 ~ "1ce/Week",
      PRAY == 5 ~ "< 1ce/Week",
      PRAY == 6 ~ "Never",
      PRAY %in% c(0, 8, 9) ~ NA_character_
      )
    ) %>% 
  rename(       # Not necessary but makes it easier to read 
    race = RACE,
    sex = SEX,
    prestige = PRESTG10,
    spouse_prestige = SPPRES10
  ) %>% 
  
  # Part 1.4: Removing data with missing values 
  drop_na()

# Part 2: Count of newgss
dim(newgss) 
## [1] 2216   10
# Part 3: Dataset of age>50
gss_old <- newgss %>% 
  filter(newage >= 50)
nrow(gss_old) # One way to show how many are there
## [1] 1044
# Part 4: High status + pray often 
richprayers <- newgss %>% 
    # Social position is greater than 7/10 & prays at least once/day 
  filter(socpos > 7 & prayer_freq <3) 

str(richprayers)  # Another way to show how many are there 
## 'data.frame':    61 obs. of  10 variables:
##  $ race           : int  1 1 1 3 1 2 1 2 2 2 ...
##  $ sex            : int  2 1 2 1 2 1 2 1 2 1 ...
##  $ AGE            : int  61 52 42 53 84 83 83 86 34 59 ...
##  $ RANK           : int  10 8 9 10 8 10 10 10 9 8 ...
##  $ prestige       : int  45 42 52 49 64 27 32 40 35 35 ...
##  $ spouse_prestige: int  0 0 31 0 0 36 0 0 0 0 ...
##  $ PRAY           : int  2 4 2 2 5 2 2 2 2 2 ...
##  $ newage         : int  61 52 42 53 84 83 83 86 34 59 ...
##  $ socpos         : int  10 8 9 10 8 10 10 10 9 8 ...
##  $ prayer_freq    : chr  "1ce/Day" "1ce/Week" "1ce/Day" "1ce/Day" ...
##  - attr(*, "col.label")= chr [1:89] "Rs religious preference" "Favor preference in hiring blacks" "Blacks overcome prejudice without favors " "How close feel to blacks  " ...

Official Question Time 6

So far today, we’ve learned

  1. Missing Data

  2. Workspaces and Projects

  3. Importing Data

  4. Coding and Recoding Data

    1. Categorical

    2. Numeric

    3. Missing