Fred Traylor, Lab TA (he/him)
September 2, 2022
Factors: Another Type of Data
Summarizing Data
Rounding Numbers
Sorting Data
One Big Practice Session
Wrap Up
(This code is recreating the dataset from earlier today. Feel free to copy and run.)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'tidyr' was built under R version 4.1.2
## Warning: package 'readr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
load("~/R/ay23_lab_ta/bootcamp/rawgss.RData")
democratcodes <- c(0,1,2)
republicancodes <- c(4,5,6)
independentcodes <- c(3,7)
dknacodes <- c(8,9)
gss_bootcamp <- GSS %>%
as_tibble() %>%
select(ID_, AGE, EDUC, SEX, RACE, # Demographics
PARTYID, POLVIEWS, # Political Variables
PRESTG10, INCOME, TVHOURS, # Economic Variables
HAPPY, HEALTH, # Quality of Life
starts_with("NAT") # National Spending
) %>%
mutate(
numbers = 123456,
newnumbers = numbers,
numbers = numbers - 100000,
bignumbers = numbers * 2,
sex_text = ifelse(SEX == 1, "Male", "Female"),
sex_text_cw = case_when(
SEX == 1 ~ "Male",
SEX == 2 ~ "Female",
),
race3_ifelse = ifelse(RACE == 1, "White",
ifelse(RACE==2, "Black",
"Other")),
race3 = case_when(
RACE == 1 ~ "White",
RACE == 2 ~ "Black",
RACE == 3 ~ "Other"),
happy_text = case_when(
HAPPY == 1 ~ "Very Happy",
HAPPY == 2 ~ "Pretty Happy",
HAPPY == 3 ~ "Not too Happy",
HAPPY > 3 ~ NA_character_),
party3 = case_when(
PARTYID %in% democratcodes ~ "Democrat",
PARTYID %in% republicancodes ~ "Republican",
PARTYID %in% independentcodes ~ "Independent",
PARTYID %in% dknacodes ~ NA_character_) ,
newage = case_when(
AGE >= 89 ~ NA_integer_,
AGE <= 88 ~ AGE
),
newtvhours = case_when(
TVHOURS >=98 ~ NA_integer_,
TVHOURS == -1 ~ NA_integer_,
TRUE ~ TVHOURS
),
newincome = case_when(
INCOME == 0 ~ NA_integer_,
INCOME >= 98 ~ NA_integer_,
TRUE ~ INCOME
),
education = case_when(
EDUC %in% c(-1, 98, 99) ~ NA_integer_,
TRUE ~ EDUC
),
newpolviews = case_when(
POLVIEWS %in% c(1, 2, 3) ~ "Liberal",
POLVIEWS %in% c(5, 6, 7) ~ "Conservative",
POLVIEWS %in% 4 ~ "Moderate",
POLVIEWS %in% c(0, 8, 9) ~ NA_character_
)
)
Previously, our data types have been:
Character
Numeric: double and integer
Logical
We have one more type of data. Well, it’s actually more a subtype of character.
This one is called “Factor.”
It acts like character data, but has one special ability: it keeps things in order.
To turn character data into factor, we use the function
factor()
.
happy <- gss_bootcamp %>%
select(happy_text) %>%
mutate(happy_factor = factor(happy_text))
str(happy)
## tibble [2,348 x 2] (S3: tbl_df/tbl/data.frame)
## $ happy_text : chr [1:2348] "Pretty Happy" "Very Happy" "Very Happy" "Very Happy" ...
## $ happy_factor: Factor w/ 3 levels "Not too Happy",..: 2 3 3 3 2 1 2 2 1 2 ...
Notice that it says happy_text
is “chr
” but
happy_factor
is “Factor w/3 levels
”.
We can use the function levels()
to see what those three
levels are.
## [1] "Not too Happy" "Pretty Happy" "Very Happy"
Note, though, that we cannot use levels() on character type data.
## NULL
Where did these levels come from, though?
The function includes an argument to specify them:
factor(VAR, levels = c(...))
. Since we didn’t include
anything for levels, it took the already-occuring values and puts them
in order (alphabetically).
So we can actually specify the levels to put them in the order we want.
happy <- gss_bootcamp %>%
mutate(happy_factor = factor(HAPPY,
levels = c("Very Happy", "Pretty Happy", "Not Too Happy")))
levels(happy$happy_factor)
## [1] "Very Happy" "Pretty Happy" "Not Too Happy"
Lastly, we can assign labels to our factors. This is another way to code them from the original integers. So instead of going integer to character to factor, we can go directly from integer to factor.
The full version of the function is:
factor(VAR, levels = c(...), labels = c(...))
. As with many
things, it is helpful to skip lines so it is clear what is
happening.
happy <- gss_bootcamp %>%
mutate(happy_factor = factor(HAPPY,
levels = 1:3,
labels = c("Very Happy", "Pretty Happy", "Not Too Happy"))) %>%
select(HAPPY, happy_factor)
levels(happy$happy_factor)
## [1] "Very Happy" "Pretty Happy" "Not Too Happy"
This is also helpful when you’re recoding the same value labels for multiple variables. This:
labelset <- c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")
dataset <- rawdata %>%
mutate(
var1 = factor(rawvar1,
levels = 1:4,
labels = labelset),
var2 = factor(rawvar2,
levels = 1:4,
labels = labelset))
is easier than:
dataset <- rawdata %>%
mutate(
var1 = factor(rawvar1,
levels = 1:4,
labels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")),
var2 = factor(rawvar2,
levels = 1:4,
labels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")))
Since we started this afternoon, we’ve done:
Take three national spending variables from our dataset (the ones starting with “NAT”) and recode them into factors. Remember to use the online codebook to get the coding right.
Double check missing data
Examples on the next slide.
gss_bootcamp <- gss_bootcamp %>%
mutate(
spend_enviro = factor(NATENVIR,
levels = 1:3,
labels = c("Too Little", "About Right", "Too Much")),
spend_race = factor(NATRACE,
levels = 1:3,
labels = c("Too Little", "About Right", "Too Much")),
spend_drug = factor(NATDRUG,
levels = 1:3,
labels = c("Too Little", "About Right", "Too Much"))
)
table(gss_bootcamp$NATENVIR, gss_bootcamp$spend_enviro, useNA = "ifany")
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 790 0 0 0
## 2 0 284 0 0
## 3 0 0 83 0
## 8 0 0 0 30
## 9 0 0 0 1
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 605 0 0 0
## 2 0 386 0 0
## 3 0 0 87 0
## 8 0 0 0 100
## 9 0 0 0 10
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 808 0 0 0
## 2 0 245 0 0
## 3 0 0 97 0
## 8 0 0 0 36
## 9 0 0 0 2
You could also save the labels in a vector and use that since they all
use the same coding scheme.
spendlabs <- c("Too Little", "About Right", "Too Much")
gss_bootcamp <- gss_bootcamp %>%
mutate(
spend_enviro = factor(NATENVIR,
levels = 1:3,
labels = spendlabs),
spend_race = factor(NATRACE,
levels = 1:3,
labels = spendlabs),
spend_drug = factor(NATDRUG,
levels = 1:3,
labels = spendlabs)
)
table(gss_bootcamp$NATENVIR, gss_bootcamp$spend_enviro, useNA = "ifany")
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 790 0 0 0
## 2 0 284 0 0
## 3 0 0 83 0
## 8 0 0 0 30
## 9 0 0 0 1
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 605 0 0 0
## 2 0 386 0 0
## 3 0 0 87 0
## 8 0 0 0 100
## 9 0 0 0 10
##
## Too Little About Right Too Much <NA>
## 0 0 0 0 1160
## 1 808 0 0 0
## 2 0 245 0 0
## 3 0 0 97 0
## 8 0 0 0 36
## 9 0 0 0 2
count()
FunctionWhen working with categorical data, sometimes we want the number of cases split by several independent variables.
For example, we can use the table()
function to get a
quick table of the number of times each combination of Party ID and
Happiness intersect.
##
## Not too Happy Pretty Happy Very Happy
## Democrat 174 581 281
## Independent 77 258 154
## Republican 82 450 254
We can also use the count()
function to give us a
similar output.
## # A tibble: 14 x 3
## party3 happy_text n
## <chr> <chr> <int>
## 1 Democrat Not too Happy 174
## 2 Democrat Pretty Happy 581
## 3 Democrat Very Happy 281
## 4 Democrat <NA> 2
## 5 Independent Not too Happy 77
## 6 Independent Pretty Happy 258
## 7 Independent Very Happy 154
## 8 Independent <NA> 2
## 9 Republican Not too Happy 82
## 10 Republican Pretty Happy 450
## 11 Republican Very Happy 254
## 12 <NA> Not too Happy 3
## 13 <NA> Pretty Happy 18
## 14 <NA> Very Happy 12
While not super helpful when there are only two variables, its real
power comes when we have three or more. Oh, and it works great with the
pipe (%>%
). (table()
is horrible with the
pipe.)
## # A tibble: 27 x 4
## party3 happy_text sex_text n
## <chr> <chr> <chr> <int>
## 1 Democrat Not too Happy Female 95
## 2 Democrat Not too Happy Male 79
## 3 Democrat Pretty Happy Female 351
## 4 Democrat Pretty Happy Male 230
## 5 Democrat Very Happy Female 161
## 6 Democrat Very Happy Male 120
## 7 Democrat <NA> Female 1
## 8 Democrat <NA> Male 1
## 9 Independent Not too Happy Female 40
## 10 Independent Not too Happy Male 37
## # ... with 17 more rows
Notice that it includes every combination of party3
,
happy_text
and sex_text
, including
NA
. We can use our new function drop_na()
to
remove missings on these variables before counting.
## # A tibble: 18 x 4
## party3 happy_text sex_text n
## <chr> <chr> <chr> <int>
## 1 Democrat Not too Happy Female 95
## 2 Democrat Not too Happy Male 79
## 3 Democrat Pretty Happy Female 351
## 4 Democrat Pretty Happy Male 230
## 5 Democrat Very Happy Female 161
## 6 Democrat Very Happy Male 120
## 7 Independent Not too Happy Female 40
## 8 Independent Not too Happy Male 37
## 9 Independent Pretty Happy Female 144
## 10 Independent Pretty Happy Male 114
## 11 Independent Very Happy Female 87
## 12 Independent Very Happy Male 67
## 13 Republican Not too Happy Female 42
## 14 Republican Not too Happy Male 40
## 15 Republican Pretty Happy Female 220
## 16 Republican Pretty Happy Male 230
## 17 Republican Very Happy Female 132
## 18 Republican Very Happy Male 122
group_by()
and summarize()
FunctionsThe table()
function is handy for categorical data, but
often make it hard to look at trends with numeric data.
dplyr
includes two useful functions that help us with
“summarizing” data “by groups.”
To use group_by()
, specify which variable(s) you want
your data grouped by. summarize()
works similar to
mutate()
in allowing us to calculate the summaries we want
to display.
## # A tibble: 4 x 2
## party3 `mean(newage, na.rm = T)`
## <chr> <dbl>
## 1 Democrat 48.79824
## 2 Independent 43.75519
## 3 Republican 50.79124
## 4 <NA> 52.54545
When summarizing, make sure to account for missings. Otherwise, you
get NA
s for any variable with missing data. Two
options:
Include na.rm=T
inside the function to remove NA’s
from data. (See above.)
Run drop_na()
on newage
. (See
below.)
## # A tibble: 3 x 2
## party3 `mean(newage)`
## <chr> <dbl>
## 1 Democrat 48.79824
## 2 Independent 43.75519
## 3 Republican 50.79124
We can also group by multiple variables:
gss_bootcamp %>%
drop_na(party3, happy_text, newage) %>%
group_by(party3, happy_text) %>%
summarise(mean(newage))
## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 3
## # Groups: party3 [3]
## party3 happy_text `mean(newage)`
## <chr> <chr> <dbl>
## 1 Democrat Not too Happy 48.96512
## 2 Democrat Pretty Happy 47.79649
## 3 Democrat Very Happy 50.61733
## 4 Independent Not too Happy 42.92208
## 5 Independent Pretty Happy 42.824
## 6 Independent Very Happy 45.71242
## 7 Republican Not too Happy 49.90123
## 8 Republican Pretty Happy 50.50790
## 9 Republican Very Happy 51.57540
And add multiple summaries.
gss_bootcamp %>%
drop_na(party3, happy_text, newage) %>%
group_by(party3, happy_text) %>%
summarise(mean(newage),
n(),
mean(PRESTG10))
## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 5
## # Groups: party3 [3]
## party3 happy_text `mean(newage)` `n()` `mean(PRESTG10)`
## <chr> <chr> <dbl> <int> <dbl>
## 1 Democrat Not too Happy 48.96512 172 41.44186
## 2 Democrat Pretty Happy 47.79649 570 43.37544
## 3 Democrat Very Happy 50.61733 277 46.37545
## 4 Independent Not too Happy 42.92208 77 32.07792
## 5 Independent Pretty Happy 42.824 250 39.376
## 6 Independent Very Happy 45.71242 153 40.96732
## 7 Republican Not too Happy 49.90123 81 38.50617
## 8 Republican Pretty Happy 50.50790 443 43.70655
## 9 Republican Very Happy 51.57540 252 46.86508
We can (and should) also give our summaries descriptive names.
gss_bootcamp %>%
drop_na(party3, happy_text, newage) %>%
group_by(party3, happy_text) %>%
summarise(mean_age = mean(newage),
count = n(),
mean_occ_prest = mean(PRESTG10))
## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.
## # A tibble: 9 x 5
## # Groups: party3 [3]
## party3 happy_text mean_age count mean_occ_prest
## <chr> <chr> <dbl> <int> <dbl>
## 1 Democrat Not too Happy 48.96512 172 41.44186
## 2 Democrat Pretty Happy 47.79649 570 43.37544
## 3 Democrat Very Happy 50.61733 277 46.37545
## 4 Independent Not too Happy 42.92208 77 32.07792
## 5 Independent Pretty Happy 42.824 250 39.376
## 6 Independent Very Happy 45.71242 153 40.96732
## 7 Republican Not too Happy 49.90123 81 38.50617
## 8 Republican Pretty Happy 50.50790 443 43.70655
## 9 Republican Very Happy 51.57540 252 46.86508
Since the last OQT, we’ve done:
count()
Functiongroup_by()
Functionsummarize()
Functionround()
Oftentimes, we have numbers with a lot of decimals.
## [1] 2.3333
We can use round()
to round those numbers to the desired
number of digits after the decimal:
## [1] 2.33
We can include it in a mutate()
function as well:
gss_bootcamp %>%
mutate(decimal = 7/3,
newdecimal = round(decimal,1)) %>%
select(decimal, newdecimal) %>%
head(10)
## # A tibble: 10 x 2
## decimal newdecimal
## <dbl> <dbl>
## 1 2.333333 2.3
## 2 2.333333 2.3
## 3 2.333333 2.3
## 4 2.333333 2.3
## 5 2.333333 2.3
## 6 2.333333 2.3
## 7 2.333333 2.3
## 8 2.333333 2.3
## 9 2.333333 2.3
## 10 2.333333 2.3
And include it within a summarize()
function:
gss_bootcamp %>%
drop_na() %>%
group_by(race3) %>%
summarize(meantv = mean(newtvhours),
roundmeantv = round(mean(newtvhours),3))
## # A tibble: 3 x 3
## race3 meantv roundmeantv
## <chr> <dbl> <dbl>
## 1 Black 3.762376 3.762
## 2 Other 2.492537 2.493
## 3 White 2.648188 2.648
arrange()
If we want to arrange a dataframe (or tibble) by a
certain value, we can use arrange()
.
## # A tibble: 637 x 1
## newage
## <int>
## 1 18
## 2 18
## 3 18
## 4 18
## 5 18
## 6 18
## 7 18
## 8 18
## 9 18
## 10 18
## # ... with 627 more rows
## # A tibble: 637 x 1
## newpolviews
## <chr>
## 1 Conservative
## 2 Conservative
## 3 Conservative
## 4 Conservative
## 5 Conservative
## 6 Conservative
## 7 Conservative
## 8 Conservative
## 9 Conservative
## 10 Conservative
## # ... with 627 more rows
Here, it sorted alphabetically by newpolviews
. In this
case, since newpolviews
has values “Conservative, Moderate,
Liberal”, it gave us “Conservative,” which is the first value
alphabetically.
desc()
arrange()
will always sort in order of smallest –>
largest for numbers and A –> Z for text. (For factors, it will give
in specified order.)
If we want to reverse this, and sort it in
descending order, we can use
desc()
. We typically use it inside arrange()
,
like so: arrange(desc(VAR))
.
## # A tibble: 10 x 1
## newage
## <int>
## 1 88
## 2 87
## 3 87
## 4 86
## 5 86
## 6 86
## 7 86
## 8 86
## 9 85
## 10 85
We can also use it with text.
## # A tibble: 10 x 1
## newpolviews
## <chr>
## 1 Moderate
## 2 Moderate
## 3 Moderate
## 4 Moderate
## 5 Moderate
## 6 Moderate
## 7 Moderate
## 8 Moderate
## 9 Moderate
## 10 Moderate
Also, since summarize()
will give us results in
alpha/numerical order, it can be helpful to arrange(desc())
the results, like this:
gss_bootcamp %>%
drop_na() %>%
group_by(race3) %>%
summarize(meanage = mean(newage)) %>%
arrange(desc(race3)) # Arrange descending order: race3
## # A tibble: 3 x 2
## race3 meanage
## <chr> <dbl>
## 1 White 49.24520
## 2 Other 39.91045
## 3 Black 45.12871
gss_bootcamp %>%
drop_na() %>%
group_by(race3) %>%
summarize(meanage = mean(newage)) %>%
arrange(desc(meanage)) # Arrange descending order: meanage
## # A tibble: 3 x 2
## race3 meanage
## <chr> <dbl>
## 1 White 49.24520
## 2 Black 45.12871
## 3 Other 39.91045
Since the last OQT, we’ve done:
round()
arrange()
desc()
This afternoon, we’ve done:
Factors
Summarizing Data
count()
group_by()
summarize()
round()
, arrange()
, and
desc()
Create a new GSS dataset “session4
”
Select the following variables: HEALTH, RACE, COHORT, EDUC, SEX, AFFRMACT, and RANK
Perform the appropriate data management
Drop any rows with missing data
Using “session4
”, create the following summary
tables
Average birth year by somebody’s health
Rounded to one decimal place
And sorted least to greatest
Average social position by views on affirmative action
Count of how many people are in each intersection group of race and sex
# Part 1
session4 <- GSS %>%
# Part 1a: Select the following variables: HEALTH, RACE, COHORT, EDUC, SEX, and RANK
select(HEALTH, RACE, COHORT, EDUC, SEX, AFFRMACT, RANK) %>%
# Part 1b: Perform the appropriate data management
mutate(
health = factor(HEALTH,
levels = 1:4,
labels = c("Excellent", "Good", "Fair", "Poor")),
race = case_when(
RACE == 1 ~ "White",
RACE == 2 ~ "Black",
RACE == 3 ~ "Other"),
cohort = case_when(
COHORT %in% c(-1, 9999) ~ NA_integer_,
TRUE ~ COHORT
),
education = case_when(
EDUC %in% c(-1, 98, 99) ~ NA_integer_,
TRUE ~ EDUC
),
sex = ifelse(SEX == 1, "Male", "Female"),
affirmative = factor(AFFRMACT,
levels = 1:4,
labels = c("Strongly Favors", "Not Strongly Favors",
"Not Strongly Opposes", "Strongly Opposes")),
socpos = case_when(
RANK %in% c(0, 98, 99) ~ NA_integer_,
TRUE ~ RANK
)
) %>%
# Part 1c: Drop any rows with missing data
drop_na()
# Part 2a: Average birth year by somebody's health
session4 %>%
group_by(health) %>%
# Round to 1 deimal
summarize(meanyear = round(mean(cohort),1)) %>%
# Sort by year
arrange(meanyear)
## # A tibble: 4 x 2
## health meanyear
## <fct> <dbl>
## 1 Poor 1965.1
## 2 Good 1968.9
## 3 Fair 1968.9
## 4 Excellent 1969.3
# Part 2b: Average social position by views on affirmative action
session4 %>%
group_by(affirmative) %>%
# Rounded to 2 decimal places
summarize(social_position = round(mean(socpos),2))
## # A tibble: 4 x 2
## affirmative social_position
## <fct> <dbl>
## 1 Strongly Favors 4.73
## 2 Not Strongly Favors 4.73
## 3 Not Strongly Opposes 4.71
## 4 Strongly Opposes 4.62
# Part 2c: Counts of race X sex
session4 %>%
group_by(sex, race) %>%
count() %>%
# Sorted from most to least
arrange(desc(n))
## # A tibble: 6 x 3
## # Groups: sex, race [6]
## sex race n
## <chr> <chr> <int>
## 1 Female White 288
## 2 Male White 229
## 3 Female Black 70
## 4 Male Other 46
## 5 Male Black 44
## 6 Female Other 34
Since the last OQT, we’ve done:
Yesterday and Today, we’ve done:
R Basics
Data Management
Filtering Rows and Selecting Columns
Coding and Recoding Data
Shortcuts & Operators
Name | Symbol | Shortcut |
---|---|---|
Assignment Arrow | <- |
ALT /Option and - |
Comment | # |
Ctrl /Cmd and Shift and
c |
Pipe | %>% |
CTRL /Cmd and Shift and
m |
Matching Operator / Percent-in-Percent | %in% |
Three common errors
Using one equals sign (=
) instead of two
(==
) for logical tests
Forgetting to separate arguments with a comma
Forgetting to close a function with )
Wow, what an action-packed two days it has been!
As you know, we’ll meet up for Soc 541 (aka Stats 1) on Monday, September 12. (No classes this coming Monday for Labor Day.)
Over the next week or so, you should go to the RStudio Primer page (https://rstudio.cloud/learn/primers) and work through the following primers:
The Basics: https://rstudio.cloud/learn/primers/1
Programming Basics
Work With Data: https://rstudio.cloud/learn/primers/2
Working with Tibbles
Isolating Data with dplyr
Deriving Information with dplyr
RStudio.Cloud is a website that hosts RStudio (free and paid). It also hosts a number of excellent primers that introduce (or, in this case, review) a number of common tasks.
Each primer has a short video (two minutes or less) followed by some questions and coding exercises.
Don’t worry if you see something we haven’t gotten to yet. The primers are fantastic because they have a solutions button: copy and paste, then see if you understand what’s going on. If it doesn’t click, feel free to skip it and move on.
dplyr
) definitely
goes beyond what we’ve covered today and what we’ll need for the
class.