Fred Traylor, Lab TA (he/him)
September 2, 2022
Missing Data
Importing Data
Cleaning Data
Editing Variable Values
Creating New Variables
Organically in your data
Coding error
Respondents
Refusal
Don’t Know
Inapplicability
Let’s create some purposefully so we can look at it.
We can use the is.na()
function to see which values in
our vector are missing.
sum
them to count how many there are. (This is
because TRUE
counts as 1 and FALSE
counts as
0.)
table
will give us a summary of how many are
TRUE
(missing) and how many are FALSE
(not
missing)
## [1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
## [1] 2
##
## FALSE TRUE
## 7 2
When we have missing data, functions applied to a vector are calculated for all values except the missings:
## [1] 135 267 21 NA 294 NA 9 135 21
When we use the table()
function on data, we can’t see the
missing values unless we include the useNA = "ifany"
argument, like this:
## missvec
## 3 7 45 89 98
## 1 2 2 1 1
## missvec
## 3 7 45 89 98 <NA>
## 1 2 2 1 1 2
Now we can see the two missing values.
We could also set it to useNA = "always"
, which includes
a spot for missings in the table, even if there aren’t any.
## nomiss
## karen kristen paul quan tom
## 3 1 2 1 1
## nomiss
## karen kristen paul quan tom
## 3 1 2 1 1
## nomiss
## karen kristen paul quan tom <NA>
## 3 1 2 1 1 0
Remember how we have different types of data:
Numeric
Double: number with something after the decimal point
Integer: number without something after the decimal (e.g. 6L, 199L)
Created by typing “L” after the number
Requires less storate than double
Character
Logical
There are also different types of missing values. Right now, we’re only
going to use four:
NA
: the default missing; how missing shows up in a
table;
NA_character_
is what you must use to assign
character data as missing
NA_integer_
is what you must use to assign
integer data
NA_real_
is what you must use to assign
double data
Note that the last three types of missings have an underscore (`_`) at the end of them.
The others might come up this next year, but we see them so rarely
that we won’t use them now.
When in doubt, run typeof()
to see how it is
stored.
a <- 1 # A number without L becomes double
b <- 3L # Integers are created by putting a capital L after the number
c <- "word"
d <- TRUE
typeof(a)
## [1] "double"
## [1] "integer"
## [1] "character"
## [1] "logical"
## [1] "double"
You might remember from yesterday that we used a function
class()
. These are very similar, but typeof()
will tell us if it is integer or double, making it more helpful in
determining if we need to use NA
or
NA_integer_
.
## [1] "double"
## [1] "numeric"
## [1] "integer"
## [1] "integer"
Let’s create some missing data:
## [1] "Blue" "Green" "Orange" NA "Red" "Blue" "Orange" NA
## [9] "NA"
## misstext
## Blue Green NA Orange Red <NA>
## 2 1 1 2 1 2
Note here that it didn’t matter whether we used NA
or
NA_character_
when we created the vector, as both register
as <NA>
in the table.
One value labeled "NA"
was kept as text, though. Because
we used quotation marks around it when we created the vector, R thought
we wanted it as text and left it undisturbed.
Now, let’s combine our two vectors into a tibble.
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## # A tibble: 9 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 NA <NA>
## 5 98 Red
## 6 NA Blue
## 7 3 Orange
## 8 45 <NA>
## 9 7 NA
We can see that some rows aren’t missing anything, while others are missing one or even two values.
If we want to keep only the rows that have no missing data, we can
use the drop_na()
function.
You might also see na.omit()
used. It is similar to
drop_na()
, but drop_na()
has better
functionality.
drop_na()
can also be piped.
## # A tibble: 6 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 98 Red
## 5 3 Orange
## 6 7 NA
## # A tibble: 6 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 98 Red
## 5 3 Orange
## 6 7 NA
drop_na
() also has the ability to remove observations
that are missing on specific columns.
## # A tibble: 6 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 98 Red
## 5 3 Orange
## 6 7 NA
## # A tibble: 7 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 98 Red
## 5 3 Orange
## 6 45 <NA>
## 7 7 NA
## # A tibble: 7 x 2
## numbers colors
## <dbl> <chr>
## 1 45 Blue
## 2 89 Green
## 3 7 Orange
## 4 98 Red
## 5 NA Blue
## 6 3 Orange
## 7 7 NA
This is useful where there is a lot of missingness throughout the dataset and we don’t want to remove observations that are missing on variables we don’t care about.
Since we started today, we’ve done:
Missing data
How to create it
Types of it
NA
NA_character_
NA_real_
NA_integer_
How to detect it
is.na()
How to remove it
drop_na()
We’re about to change gears a little and won’t need these objects we just saved. Let’s remove them from the global environment so they won’t clutter up our workspace.
## [1] "a" "b" "c" "d" "misstext"
## [6] "misstibble" "missvec" "nomiss"
## character(0)
This morning, I sent you a file containing data from the General Social Survey (GSS).
Go ahead and save that into your working directory, then click on it to import it into your Global Environment.
## [1] "C:/Users/fhtra/Documents/R/ay23_lab_ta/bootcamp"
Using our imported GSS data, and what we’ve learned so far on data
management, let’s create some variables.
Looking at the size of the dataset (dim(GSS)
= 2348,
89), we can see that there are way more variables than we want right
now. Let’s create a new dataset “gss_bootcamp”, and
select only a few of the variables we want.
Demographic factors
ID_ (Respondent’s ID)
AGE (age in years)
EDUC (years of education)
Political Variables
PARTYID (Party Identification)
POLVIEWS (Political Ideology)
Economic Variables
PRESTG10 (occupational prestige score)
INCOME (annual income in categories)
TVHOURS (hours of TV watched last week)
Quality of Life
HAPPY (General Happiness)
HEALTH (General Health)
We also want a set of variables asking what they think about government spending on a series of factors. These all start with “NAT” and end with a descriptor of the topic.
Go ahead and try this yourself, then go to the next slide to see how I did it.
library(tidyverse)
gss_bootcamp <- GSS %>%
as_tibble() %>%
select(ID_, AGE, EDUC, # Demographics
PARTYID, POLVIEWS, # Political Variables
PRESTG10, INCOME, TVHOURS, # Economic Variables
HAPPY, HEALTH, # Quality of Life
starts_with("NAT") # National Spending
)
Now, the dataset is much smaller with only 28 variables.
The Beauty of Source Code
Actually, let’s add sex and race to the dataset too.
If we didn’t do this as source, we’d have to entirely retype the previous section of code.
But instead, we can simply go back and add “SEX” and “RACE” to our code and rerun the section.
gss_bootcamp <- GSS %>%
as_tibble() %>%
select(ID_, AGE, EDUC, SEX, RACE, # Demographics
PARTYID, POLVIEWS, # Political Variables
PRESTG10, INCOME, TVHOURS, # Economic Variables
HAPPY, HEALTH, # Quality of Life
starts_with("NAT") # National Spending
)
This is why it’s also important not to overwrite the original file (“GSS”).
GSS
is kept unchanged the entire time, we can
simply rerun the code without having to load up the dataset from our
files again.And now, let’s take a look at our dataset with the
View(gss_bootcamp)
function to see it in our source pane,
or print(gss_bootcamp)
to see it in the console. Go ahead
and pick one (or both).
## # A tibble: 2,348 x 30
## ID_ AGE EDUC SEX RACE PARTYID POLVIEWS PRESTG10 INCOME TVHOURS HAPPY
## <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 1 43 14 1 1 5 6 47 13 3 2
## 2 2 74 10 2 1 2 8 22 12 -1 1
## 3 3 42 16 1 1 4 5 61 12 1 1
## 4 4 63 16 2 1 2 4 59 13 1 1
## 5 5 71 18 1 2 6 7 53 13 -1 2
## 6 6 67 16 2 1 2 3 53 98 10 3
## 7 7 59 13 2 2 0 4 48 10 -1 2
## 8 8 43 12 1 1 5 5 35 12 -1 2
## 9 9 62 8 2 1 3 4 35 5 4 3
## 10 10 55 12 1 1 1 8 39 12 2 2
## # ... with 2,338 more rows, and 19 more variables: HEALTH <int>, NATFARE <int>,
## # NATROAD <int>, NATSOC <int>, NATMASS <int>, NATPARK <int>, NATCHLD <int>,
## # NATSCI <int>, NATENRGY <int>, NATAID <int>, NATARMS <int>, NATSPAC <int>,
## # NATENVIR <int>, NATHEAL <int>, NATCITY <int>, NATCRIME <int>,
## # NATDRUG <int>, NATEDUC <int>, NATRACE <int>
We see that all we have are numbers. Next up: how to make these numbers cleaner for us to work with.
Since the last OQT, we’ve done:
Creating a new dataset from our old one
Selecting variables of interest
Let’s start by adding a variable onto our dataset.
The tidyverse package dplyr
gives us a great tool for
mutating our variables. It’s called, aptly enough,
mutate()
.
It takes the form mutate(variable=value)
.
Let’s say we want to add a column to our gss_bootcamp
where every value is 123456. We can do that with:
## # A tibble: 2,348 x 1
## numbers
## <dbl>
## 1 123456
## 2 123456
## 3 123456
## 4 123456
## 5 123456
## 6 123456
## 7 123456
## 8 123456
## 9 123456
## 10 123456
## # ... with 2,338 more rows
We told it that the column “numbers” should contain 123456. Because the column didn’t exist, it added it onto our existing dataset.
We can also use mutate to reference columns within the same dataset, or even the column itself. For example:
gss_bootcamp %>%
mutate(newnumbers = numbers,
numbers = numbers - 100000,
bignumbers = numbers * 2) %>%
select(numbers, newnumbers, bignumbers)
## # A tibble: 2,348 x 3
## numbers newnumbers bignumbers
## <dbl> <dbl> <dbl>
## 1 23456 123456 46912
## 2 23456 123456 46912
## 3 23456 123456 46912
## 4 23456 123456 46912
## 5 23456 123456 46912
## 6 23456 123456 46912
## 7 23456 123456 46912
## 8 23456 123456 46912
## 9 23456 123456 46912
## 10 23456 123456 46912
## # ... with 2,338 more rows
We gave it our dataset, piped it down, and told R to mutate three columns
newnumbers
, a new column, should take the value of
numbers
numbers
, an existing column, should take the value
numbers - 100000
,
bignumbers
, a new column, should take the value
numbers * 2
.
numbers
, not
the original value, since it came after we changed the value.ifelse()
One of the most basic computing functions is ifelse()
.
It takes the form ifelse(test, yes, no)
. In other words,
“if the data passes the test with TRUE, follow the yes condition; else,
follow the no condition.”
To illustrate, let’s make a small vector:
## [1] 1 1 1 0 0 0 0
Here’s what it did:
Take each value in smallvec
Test if the value is less than 4
If TRUE, return 1
If FALSE, return 0
We can also do this and make it return whatever we want. For example
## [1] "Small" "Small" "Small" "Big" "Big" "Big" "Big"
## [1] 46.8 46.8 46.8 717409.0 717409.0 717409.0 717409.0
## [1] 1 2 3 64 125 216 343
So now let’s recode our variable. In the GSS, SEX
is coded
1 = male
, 2 = female
, but we only have the
numbers in our dataset. We can use our ifelse()
function to
give these values text names. Let’s create a new variable called
sex_text
that has text values for the variable
sex
.
After recoding like this, I like to go back and make a
table()
to make sure my code did what I wanted.
##
## 1 2
## Female 0 1296
## Male 1052 0
case_when()
We can also do this more explicitly using case_when()
, a
function from the tidyverse. case_when()
lets us do
variable management with logical tests in a way that is easy to follow
and understand.
It takes the form
dataset %>% # Dataset
mutate(variable = case_when( # Mutate a variable depending on a case
condition ~ value,
condition ~ value
))
What does this mean?
We start with our dataset, as always.
Then, we tell R to mutate
and give it our
variable.
Then we say, “actually, instead of applying the same value (or value calculation) for everybody, R should assign the value depending on a condition”
Lastly, we close our parentheses
So we can code sex again in a different way, this time using
case_when()
.
gss_bootcamp <- gss_bootcamp %>%
mutate(sex_text_cw = case_when(
SEX == 1 ~ "Male", # If SEX == 1, then assign the value "Male"
SEX == 2 ~ "Female", # If SEX == 2, then assign the value "Female"
))
See what we did there?
Start with our data and pipe it down
mutate
our data so that sex_text_cw
takes a value dependent on the following conditions
If SEX == 1
, then assign the value “Male”
If SEX == 2
, then assign the value “Female”
Lastly, let’s create two tables showing that the coding worked exactly as we wanted and there are no missing values.
table(gss_bootcamp$sex_text_cw, gss_bootcamp$SEX, useNA = "ifany") # Checking our work with the original variable
##
## 1 2
## Female 0 1296
## Male 1052 0
table(gss_bootcamp$sex_text_cw, gss_bootcamp$sex_text, useNA = "ifany") # Checking our work with the ifelse variable from earlier
##
## Female Male
## Female 1296 0
## Male 0 1052
case_when()
with Three or More GroupsAs we just saw, case_when()
is little more than a string
of ifelse()
values.
We can do the same thing as case_when()
using
ifelse()
but it get’s tricky with more than two levels. For
example, let’s recode RACE to give it text values.
gss_bootcamp <- gss_bootcamp %>%
mutate(race3_ifelse = ifelse(RACE == 1, "White", # If race is 1, name it White
ifelse(RACE==2, "Black", # If it isn't 1, now test: If race is 2, name it Black
"Other"))) # If it isn't 2, now name it Other
table(gss_bootcamp$race3_ifelse, gss_bootcamp$RACE, useNA = "ifany")
##
## 1 2 3
## Black 0 385 0
## Other 0 0 270
## White 1693 0 0
It totally works, but is kinda hard to follow. Here’s the same thing
using case_when()
.
gss_bootcamp <- gss_bootcamp %>%
mutate(race3 = case_when(
RACE == 1 ~ "White",
RACE == 2 ~ "Black",
RACE == 3 ~ "Other"))
table(gss_bootcamp$race3, gss_bootcamp$RACE, useNA = "ifany")
##
## 1 2 3
## Black 0 385 0
## Other 0 0 270
## White 1693 0 0
With this, it is easier to follow what the conditions and output values are.
Congrats! You now know the basics of data management!
Now, we’re going to do a lot of practice on this. We’re going to do a few variables together, then have a big chunk of time to work on this separately.
Earlier, I knew the values of SEX and RACE off the top of my head because I’ve worked with them a lot. But the GSS has hundreds of variables, and our small one still has 30. We don’t have to remember the values; instead, we can turn to the codebook for more detail. A codebook is a listing of all the variables that says what the values they give in the dataset actually mean.
The first variable we’re going to work with is HAPPY. Let’s find the variable in GSS’s online codebook to get started.
Click “SEARCH VARIABLES” (no account required)
Type “happy” into the search bar and click it when it pops up
We see there that the data is coded as follows:
Code | Label |
---|---|
1 | Very happy |
2 | Pretty happy |
3 | Not too happy |
8 | Don’t know |
9 | No answer |
0 | Not applicable |
Let’s create a ‘text’ version of this variable that uses the labels
instead of the codes. We can use case_when()
to do this. In
the spirit of making our names short but descriptive, and not
overwriting anything, let’s call it happy_text
.
gss_bootcamp <- gss_bootcamp %>%
mutate(happy_text = case_when(
HAPPY == 1 ~ "Very Happy",
HAPPY == 2 ~ "Pretty Happy",
HAPPY == 3 ~ "Not too Happy",
HAPPY > 3 ~ NA_character_ # Since there aren't any 0's, we don't need to add a line for 0's
))
After coding, let’s create a table of the old and new variables to compare. Remember to tell R to show the missings, too, so we can make sure they were property coded.
##
## Not too Happy Pretty Happy Very Happy <NA>
## 1 0 0 701 0
## 2 0 1307 0 0
## 3 336 0 0 0
## 8 0 0 0 4
## <NA> 0 0 0 0
Great, it all works as we expected!
Let’s repeat the previous procedure for PARTYID
. We see
the following coding:
Code | Label |
---|---|
0 | Strong democrat |
1 | Not str democrat |
2 | Ind, near dem |
3 | Independent |
4 | Ind, near rep |
5 | Not str republican |
6 | Strong republican |
7 | Other party |
8 | Don’t know |
9 | No answer |
Let’s condense this down to three categories:
Democrats (0-2)
Republicans (4-6)
Independents / Other Party (3 & 7)
Don’t Know / No Answer (8 & 9) - set as missing
Because the codes don’t line up easily with our categories, we can
use the %in%
operator to help us out.
Let’s first create a set of vectors that contain the values for each category.
democratcodes <- c(0,1,2)
republicancodes <- c(4,5,6)
independentcodes <- c(3,7)
dknacodes <- c(8,9)
Then, using our %in%
operator, we can recode
partyid
a little easier now.
gss_bootcamp <- gss_bootcamp %>%
mutate(party3 = case_when( # NOTE that I created a new variable instead of overwriting the old one
PARTYID %in% democratcodes ~ "Democrat",
PARTYID %in% republicancodes ~ "Republican",
PARTYID %in% independentcodes ~ "Independent",
PARTYID %in% dknacodes ~ NA_character_,
))
table(gss_bootcamp$PARTYID, gss_bootcamp$party3, useNA = "always") # We can use the table feature to make sure everybody is where they're supposed to be
##
## Democrat Independent Republican <NA>
## 0 379 0 0 0
## 1 352 0 0 0
## 2 307 0 0 0
## 3 0 414 0 0
## 4 0 0 259 0
## 5 0 0 272 0
## 6 0 0 255 0
## 7 0 77 0 0
## 9 0 0 0 33
## <NA> 0 0 0 0
Some of our variables, like AGE
and TVHOURS
are numeric. Let’s take a quick look at them:
##
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## 22 26 15 27 40 29 38 43 31 39 45 43 50 34 43 42 65 40 40 43 38 55 41 40 40 39
## 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
## 40 29 42 30 33 31 37 41 29 48 35 48 50 39 37 46 46 39 24 37 33 39 36 27 39 41
## 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 99
## 45 33 22 20 29 23 19 28 11 9 12 9 12 7 10 12 14 5 8 29 7
## [1] "integer"
##
## -1 0 1 2 3 4 5 6 7 8 9 10 12 14 15 16 17 18 20 24
## 789 145 349 376 240 169 100 63 13 38 6 23 14 2 4 3 1 1 4 4
## 98 99
## 3 1
## [1] "integer"
Yep, they sure are numeric. (Or, at least integer.) But if we look at the end of the table and at the codebook, we can see there’s some weird things happening.
For AGE
:
Code | Label |
---|---|
89 | 89 or older |
98 | Don’t Know |
99 | No Answer |
And for TVHOURS
:
Code | Label |
---|---|
-1 | Not Applicable |
98 | Don’t Know |
99 | No Answer |
Let’s code all of these as missing and keep the values for everything
else. Also, just as a quirk of R, because
typeof(gss_bootcamp$AGE)
= integer and
typeof(gss_bootcamp$TVHOURS)
= integer, we use
NA_integer_
here instead of just setting them to
NA
.
gss_bootcamp <- gss_bootcamp %>%
mutate(
newage = case_when( # You can't go wrong with just calling a new variable "newvariable"
AGE >= 89 ~ NA_integer_,
AGE <= 88 ~ AGE
),
newtvhours = case_when(
TVHOURS >=98 ~ NA_integer_,
TVHOURS == -1 ~ NA_integer_,
TRUE ~ TVHOURS
)
)
table(gss_bootcamp$newage, useNA = "ifany")
##
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
## 22 26 15 27 40 29 38 43 31 39 45 43 50 34 43 42
## 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
## 65 40 40 43 38 55 41 40 40 39 40 29 42 30 33 31
## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
## 37 41 29 48 35 48 50 39 37 46 46 39 24 37 33 39
## 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
## 36 27 39 41 45 33 22 20 29 23 19 28 11 9 12 9
## 82 83 84 85 86 87 88 <NA>
## 12 7 10 12 14 5 8 36
##
## -1 0 1 2 3 4 5 6 7 8 9 10 12 14 15 16 17 18
## 0 0 145 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 1 0 0 349 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 376 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 240 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 169 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 100 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 63 0 0 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0
## 10 0 0 0 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0
## 14 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0
## 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0
## 17 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## <NA> 789 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 20 24 98 99
## 0 0 0 0 0
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## 12 0 0 0 0
## 14 0 0 0 0
## 15 0 0 0 0
## 16 0 0 0 0
## 17 0 0 0 0
## 18 0 0 0 0
## 20 4 0 0 0
## 24 0 4 0 0
## <NA> 0 0 3 1
This diagonal line means it’s working. 0 goes to 0, 9 goes to 9, etc.
But the one’s we coded to NA (-1, 98, 99) are now
<NA>
.
In the last line for newtvhours
, I add a final statement
TRUE ~ TVHOURS
. This means “everything
left over” that hasn’t already been coded. It is very handy when we’re
working with multiple conditions as it takes everything that hasn’t been
previously included. Just be careful, though, because you might not want
everything that’s left.
Since the last OQT, we’ve done:
Recoding with ifelse()
Recoding with case_when()
Practicing Data Management with Categorical Data
SEX
–> sex_text
RACE
–> race3
HAPPY
–> happy_text
PARTYID
–> party3
Practicing Data Management With Numeric Data
AGE
–> newage
TVHOURS
–> newtvhours
Let’s take some time to code our data.
Using the online codebook, code the following variables:
INCOME
Recode 0, 98, and 99 to missing
Everything else keep the same
EDUC
Recode 98 and 99 to missing
Everything else keep the same
POLVIEWS
Code as liberal, moderate, conservative
Don’t know, No Answer, and Not Applicable code to missing
Pick three of the national spending variables (starts with “nat”)
The next slide has snippets of code for if/when you get stuck.
gss_bootcamp <- gss_bootcamp %>%
mutate(
#### Sample Recode for income Variable
newincome = case_when(
INCOME == 0 ~ NA_integer_,
INCOME >= 98 ~ NA_integer_,
TRUE ~ INCOME
),
#### Sample Recode for education Variable
education = case_when(
EDUC %in% c(-1, 98, 99) ~ NA_integer_,
TRUE ~ EDUC
),
#### Sample Recode for polviews Variable
newpolviews = case_when(
POLVIEWS %in% c(1, 2, 3) ~ "Liberal",
POLVIEWS %in% c(5, 6, 7) ~ "Conservative",
POLVIEWS %in% 4 ~ "Moderate",
POLVIEWS %in% c(0, 8, 9) ~ NA_character_
),
#### Sample Recode for spending Variable
newnatenvir = case_when(
NATENVIR %in% c(8, 9, 0) ~ NA_integer_,
TRUE ~ NATENVIR
))
Since the last OQT, we’ve done:
For one last practice, complete the following steps:
GSS Recoding
Load the GSS data from the file
From “GSS,” create a dataset “newgss” that includes the following variables
Race
Sex
Age
Self-ranked social position (“RANK
”)
Occupational Prestige
Spouse’s Occupational Prestige (“SPPRES10
”)
Frequency of Prayer (“PRAY
”)
Recode data and missing data appropriately for the analyses in Parts 3 and 4
Remove anybody with missing values
Show how many rows and columns are in newgss
From newgss, create a new dataset of only those age 50 and up and show how many are in it
From newgss, create a new dataset of high-status people who pray often (how ever you choose to define it), then show many are in it
load("~/R/ay23_lab_ta/gss_files/rawgss.RData")
# Part 1.1: Load GSS and name it newgss
newgss <- GSS %>%
# Part 1.2: Variable selection
select(RACE, SEX, AGE, RANK,
PRESTG10, SPPRES10, PRAY) %>%
# Part 1.3: Recoding Data (Including Missing Data)
mutate(
newage = case_when(
AGE >= 89 ~ NA_integer_,
AGE <= 88 ~ AGE
),
socpos = case_when(
RANK %in% c(0, 98, 99) ~ NA_integer_,
TRUE ~ RANK
),
prayer_freq = case_when(
PRAY == 1 ~ "Several Times/Day",
PRAY == 2 ~ "1ce/Day",
PRAY == 3 ~ "Several Times/Week",
PRAY == 4 ~ "1ce/Week",
PRAY == 5 ~ "< 1ce/Week",
PRAY == 6 ~ "Never",
PRAY %in% c(0, 8, 9) ~ NA_character_
)
) %>%
rename( # Not necessary but makes it easier to read
race = RACE,
sex = SEX,
prestige = PRESTG10,
spouse_prestige = SPPRES10
) %>%
# Part 1.4: Removing data with missing values
drop_na()
# Part 2: Count of newgss
dim(newgss)
## [1] 2216 10
# Part 3: Dataset of age>50
gss_old <- newgss %>%
filter(newage >= 50)
nrow(gss_old) # One way to show how many are there
## [1] 1044
# Part 4: High status + pray often
richprayers <- newgss %>%
# Social position is greater than 7/10 & prays at least once/day
filter(socpos > 7 & prayer_freq <3)
str(richprayers) # Another way to show how many are there
## 'data.frame': 61 obs. of 10 variables:
## $ race : int 1 1 1 3 1 2 1 2 2 2 ...
## $ sex : int 2 1 2 1 2 1 2 1 2 1 ...
## $ AGE : int 61 52 42 53 84 83 83 86 34 59 ...
## $ RANK : int 10 8 9 10 8 10 10 10 9 8 ...
## $ prestige : int 45 42 52 49 64 27 32 40 35 35 ...
## $ spouse_prestige: int 0 0 31 0 0 36 0 0 0 0 ...
## $ PRAY : int 2 4 2 2 5 2 2 2 2 2 ...
## $ newage : int 61 52 42 53 84 83 83 86 34 59 ...
## $ socpos : int 10 8 9 10 8 10 10 10 9 8 ...
## $ prayer_freq : chr "1ce/Day" "1ce/Week" "1ce/Day" "1ce/Day" ...
## - attr(*, "col.label")= chr [1:89] "Rs religious preference" "Favor preference in hiring blacks" "Blacks overcome prejudice without favors " "How close feel to blacks " ...
So far today, we’ve learned
Missing Data
Workspaces and Projects
Importing Data
Coding and Recoding Data
Categorical
Numeric
Missing