Loading our Dataset

(This code is recreating the dataset from earlier today. Feel free to copy and run.)

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

load("~/R/ay23_lab_ta/bootcamp/rawgss.RData")

democratcodes <- c(0,1,2)
republicancodes <- c(4,5,6)
independentcodes <- c(3,7)
dknacodes <- c(8,9)

gss_bootcamp <- GSS %>% 
  as_tibble() %>% 
  select(ID_, AGE, EDUC, SEX, RACE,      # Demographics
         PARTYID, POLVIEWS,              # Political Variables
         PRESTG10, INCOME, TVHOURS,      # Economic Variables
         HAPPY, HEALTH,                  # Quality of Life
         starts_with("NAT")              # National Spending 
         ) %>% 
  mutate(
    numbers = 123456,
    newnumbers = numbers,
     numbers = numbers - 100000,
     bignumbers = numbers * 2,
    
    sex_text = ifelse(SEX == 1, "Male", "Female"),
    sex_text_cw = case_when(
      SEX == 1 ~ "Male",   
      SEX == 2 ~ "Female", 
      ),
    race3_ifelse = ifelse(RACE == 1, "White",
                          ifelse(RACE==2, "Black",
                                 "Other")),
    race3 = case_when(
      RACE == 1 ~ "White",
      RACE == 2 ~ "Black",
      RACE == 3 ~ "Other"),
    
    happy_text = case_when(
      HAPPY == 1 ~ "Very Happy",
      HAPPY == 2 ~ "Pretty Happy",
      HAPPY == 3 ~ "Not too Happy",
      HAPPY > 3  ~ NA_character_),
    
    party3 = case_when(
      PARTYID %in% democratcodes ~ "Democrat",
      PARTYID %in% republicancodes ~ "Republican",
      PARTYID %in% independentcodes ~ "Independent",
      PARTYID %in% dknacodes ~ NA_character_) ,
    newage = case_when(
      AGE >= 89 ~ NA_integer_,
      AGE <= 88 ~ AGE
      ),
    newtvhours = case_when(
      TVHOURS >=98 ~ NA_integer_,
      TVHOURS == -1 ~ NA_integer_,
      TRUE ~ TVHOURS
      ),
    
    newincome = case_when(
      INCOME == 0 ~ NA_integer_,
      INCOME >= 98 ~ NA_integer_,
      TRUE ~ INCOME
      ),
    education = case_when(
      EDUC %in% c(-1, 98, 99) ~ NA_integer_,
      TRUE ~ EDUC
      ),
    
    newpolviews = case_when(
      POLVIEWS %in% c(1, 2, 3) ~ "Liberal",
      POLVIEWS %in% c(5, 6, 7) ~ "Conservative",
      POLVIEWS %in% 4 ~ "Moderate",
      POLVIEWS %in% c(0, 8, 9) ~ NA_character_
      )
    )

Factors: Our Last Data Type

Data Type Review

Previously, our data types have been:

Character
Numeric: double and integer
Logical

We have one more type of data. Well, it’s actually more a subtype of character.

This one is called “Factor.”

It acts like character data, but has one special ability: it keeps things in order.

Creating Factor Data

To turn character data into factor, we use the function factor().

happy <- gss_bootcamp %>% 
  select(happy_text) %>% 
  mutate(happy_factor = factor(happy_text)) 
str(happy)

## tibble [2,348 x 2] (S3: tbl_df/tbl/data.frame)
##  $ happy_text  : chr [1:2348] "Pretty Happy" "Very Happy" "Very Happy" "Very Happy" ...
##  $ happy_factor: Factor w/ 3 levels "Not too Happy",..: 2 3 3 3 2 1 2 2 1 2 ...

Notice that it says happy_text is “chr” but happy_factor is “Factor w/3 levels”.

We can use the function levels() to see what those three levels are.

levels(happy$happy_factor)

## [1] "Not too Happy" "Pretty Happy"  "Very Happy"

Note, though, that we cannot use levels() on character type data.

levels(happy$happy_text)

## NULL

Factor Levels

Where did these levels come from, though?

The function includes an argument to specify them: factor(VAR, levels = c(...)). Since we didn’t include anything for levels, it took the already-occuring values and puts them in order (alphabetically).

So we can actually specify the levels to put them in the order we want.

happy <- gss_bootcamp %>% 
  mutate(happy_factor = factor(HAPPY,
                               levels = c("Very Happy", "Pretty Happy", "Not Too Happy"))) 
levels(happy$happy_factor)

## [1] "Very Happy"    "Pretty Happy"  "Not Too Happy"

Factor Labels

Lastly, we can assign labels to our factors. This is another way to code them from the original integers. So instead of going integer to character to factor, we can go directly from integer to factor.

The full version of the function is: factor(VAR, levels = c(...), labels = c(...)). As with many things, it is helpful to skip lines so it is clear what is happening.

happy <- gss_bootcamp %>% 
  mutate(happy_factor = factor(HAPPY,
                               levels = 1:3,
                               labels = c("Very Happy", "Pretty Happy", "Not Too Happy"))) %>% 
  select(HAPPY, happy_factor) 
levels(happy$happy_factor)

## [1] "Very Happy"    "Pretty Happy"  "Not Too Happy"

This is also helpful when you’re recoding the same value labels for multiple variables. This:

labelset <- c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")
dataset <- rawdata %>% 
  mutate(
    var1 = factor(rawvar1,
                  levels = 1:4,
                  labels = labelset),
    var2 = factor(rawvar2,
                  levels = 1:4,
                  labels = labelset))

is easier than:

dataset <- rawdata %>% 
  mutate(
    var1 = factor(rawvar1,
                  levels = 1:4,
                  labels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")),
    var2 = factor(rawvar2,
                  levels = 1:4,
                  labels = c("Strongly Agree", "Agree", "Disagree", "Strongly Disagree")))

Official Question Time 1

Since we started this afternoon, we’ve done:

Factors

Practice with Factors

Take three national spending variables from our dataset (the ones starting with “NAT”) and recode them into factors. Remember to use the online codebook to get the coding right.

```
  Double check missing data
```

Examples on the next slide.

Practice Examples

gss_bootcamp <- gss_bootcamp %>% 
  mutate(
    spend_enviro = factor(NATENVIR,
                          levels = 1:3,
                          labels = c("Too Little", "About Right", "Too Much")),
    spend_race = factor(NATRACE,
                        levels = 1:3,
                        labels = c("Too Little", "About Right", "Too Much")),
    spend_drug = factor(NATDRUG,
                        levels = 1:3,
                        labels = c("Too Little", "About Right", "Too Much"))
    )

table(gss_bootcamp$NATENVIR, gss_bootcamp$spend_enviro, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        790           0        0    0
##   2          0         284        0    0
##   3          0           0       83    0
##   8          0           0        0   30
##   9          0           0        0    1

table(gss_bootcamp$NATRACE, gss_bootcamp$spend_race, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        605           0        0    0
##   2          0         386        0    0
##   3          0           0       87    0
##   8          0           0        0  100
##   9          0           0        0   10

table(gss_bootcamp$NATDRUG, gss_bootcamp$spend_drug, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        808           0        0    0
##   2          0         245        0    0
##   3          0           0       97    0
##   8          0           0        0   36
##   9          0           0        0    2

You could also save the labels in a vector and use that since they all use the same coding scheme.

spendlabs <- c("Too Little", "About Right", "Too Much")

gss_bootcamp <- gss_bootcamp %>% 
  mutate(
    spend_enviro = factor(NATENVIR,
                          levels = 1:3,
                          labels = spendlabs),
    spend_race = factor(NATRACE,
                        levels = 1:3,
                        labels = spendlabs),
    spend_drug = factor(NATDRUG,
                        levels = 1:3,
                        labels = spendlabs)
    )

table(gss_bootcamp$NATENVIR, gss_bootcamp$spend_enviro, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        790           0        0    0
##   2          0         284        0    0
##   3          0           0       83    0
##   8          0           0        0   30
##   9          0           0        0    1

table(gss_bootcamp$NATRACE, gss_bootcamp$spend_race, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        605           0        0    0
##   2          0         386        0    0
##   3          0           0       87    0
##   8          0           0        0  100
##   9          0           0        0   10

table(gss_bootcamp$NATDRUG, gss_bootcamp$spend_drug, useNA = "ifany")

##    
##     Too Little About Right Too Much <NA>
##   0          0           0        0 1160
##   1        808           0        0    0
##   2          0         245        0    0
##   3          0           0       97    0
##   8          0           0        0   36
##   9          0           0        0    2

The `count()` Function

When working with categorical data, sometimes we want the number of cases split by several independent variables.

For example, we can use the table() function to get a quick table of the number of times each combination of Party ID and Happiness intersect.

table(gss_bootcamp$party3, gss_bootcamp$happy_text)

##              
##               Not too Happy Pretty Happy Very Happy
##   Democrat              174          581        281
##   Independent            77          258        154
##   Republican             82          450        254

We can also use the count() function to give us a similar output.

count(gss_bootcamp, party3, happy_text)

## # A tibble: 14 x 3
##    party3      happy_text        n
##    <chr>       <chr>         <int>
##  1 Democrat    Not too Happy   174
##  2 Democrat    Pretty Happy    581
##  3 Democrat    Very Happy      281
##  4 Democrat    <NA>              2
##  5 Independent Not too Happy    77
##  6 Independent Pretty Happy    258
##  7 Independent Very Happy      154
##  8 Independent <NA>              2
##  9 Republican  Not too Happy    82
## 10 Republican  Pretty Happy    450
## 11 Republican  Very Happy      254
## 12 <NA>        Not too Happy     3
## 13 <NA>        Pretty Happy     18
## 14 <NA>        Very Happy       12

While not super helpful when there are only two variables, its real power comes when we have three or more. Oh, and it works great with the pipe (%>%). (table() is horrible with the pipe.)

gss_bootcamp %>% count(party3, happy_text, sex_text)

## # A tibble: 27 x 4
##    party3      happy_text    sex_text     n
##    <chr>       <chr>         <chr>    <int>
##  1 Democrat    Not too Happy Female      95
##  2 Democrat    Not too Happy Male        79
##  3 Democrat    Pretty Happy  Female     351
##  4 Democrat    Pretty Happy  Male       230
##  5 Democrat    Very Happy    Female     161
##  6 Democrat    Very Happy    Male       120
##  7 Democrat    <NA>          Female       1
##  8 Democrat    <NA>          Male         1
##  9 Independent Not too Happy Female      40
## 10 Independent Not too Happy Male        37
## # ... with 17 more rows

Notice that it includes every combination of party3, happy_text and sex_text, including NA. We can use our new function drop_na() to remove missings on these variables before counting.

gss_bootcamp %>%
  drop_na(party3, happy_text, sex_text) %>% 
  count(party3, happy_text, sex_text)

## # A tibble: 18 x 4
##    party3      happy_text    sex_text     n
##    <chr>       <chr>         <chr>    <int>
##  1 Democrat    Not too Happy Female      95
##  2 Democrat    Not too Happy Male        79
##  3 Democrat    Pretty Happy  Female     351
##  4 Democrat    Pretty Happy  Male       230
##  5 Democrat    Very Happy    Female     161
##  6 Democrat    Very Happy    Male       120
##  7 Independent Not too Happy Female      40
##  8 Independent Not too Happy Male        37
##  9 Independent Pretty Happy  Female     144
## 10 Independent Pretty Happy  Male       114
## 11 Independent Very Happy    Female      87
## 12 Independent Very Happy    Male        67
## 13 Republican  Not too Happy Female      42
## 14 Republican  Not too Happy Male        40
## 15 Republican  Pretty Happy  Female     220
## 16 Republican  Pretty Happy  Male       230
## 17 Republican  Very Happy    Female     132
## 18 Republican  Very Happy    Male       122

The `group_by()` and `summarize()` Functions

The table() function is handy for categorical data, but often make it hard to look at trends with numeric data. dplyr includes two useful functions that help us with “summarizing” data “by groups.”

To use group_by(), specify which variable(s) you want your data grouped by. summarize() works similar to mutate() in allowing us to calculate the summaries we want to display.

gss_bootcamp %>% 
  group_by(party3) %>% 
  summarise(mean(newage, na.rm=T))

## # A tibble: 4 x 2
##   party3      `mean(newage, na.rm = T)`
##   <chr>                           <dbl>
## 1 Democrat                     48.79824
## 2 Independent                  43.75519
## 3 Republican                   50.79124
## 4 <NA>                         52.54545

When summarizing, make sure to account for missings. Otherwise, you get NAs for any variable with missing data. Two options:

Include na.rm=T inside the function to remove NA’s from data. (See above.)
Run drop_na() on newage. (See below.)

gss_bootcamp %>% 
  drop_na(party3, newage) %>% 
  group_by(party3) %>% 
  summarise(mean(newage))

## # A tibble: 3 x 2
##   party3      `mean(newage)`
##   <chr>                <dbl>
## 1 Democrat          48.79824
## 2 Independent       43.75519
## 3 Republican        50.79124

We can also group by multiple variables:

gss_bootcamp %>% 
  drop_na(party3, happy_text, newage) %>% 
  group_by(party3, happy_text) %>% 
  summarise(mean(newage))

## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.

## # A tibble: 9 x 3
## # Groups:   party3 [3]
##   party3      happy_text    `mean(newage)`
##   <chr>       <chr>                  <dbl>
## 1 Democrat    Not too Happy       48.96512
## 2 Democrat    Pretty Happy        47.79649
## 3 Democrat    Very Happy          50.61733
## 4 Independent Not too Happy       42.92208
## 5 Independent Pretty Happy        42.824  
## 6 Independent Very Happy          45.71242
## 7 Republican  Not too Happy       49.90123
## 8 Republican  Pretty Happy        50.50790
## 9 Republican  Very Happy          51.57540

And add multiple summaries.

gss_bootcamp %>% 
  drop_na(party3, happy_text, newage) %>% 
  group_by(party3, happy_text) %>% 
  summarise(mean(newage),
            n(),
            mean(PRESTG10))

## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.

## # A tibble: 9 x 5
## # Groups:   party3 [3]
##   party3      happy_text    `mean(newage)` `n()` `mean(PRESTG10)`
##   <chr>       <chr>                  <dbl> <int>            <dbl>
## 1 Democrat    Not too Happy       48.96512   172         41.44186
## 2 Democrat    Pretty Happy        47.79649   570         43.37544
## 3 Democrat    Very Happy          50.61733   277         46.37545
## 4 Independent Not too Happy       42.92208    77         32.07792
## 5 Independent Pretty Happy        42.824     250         39.376  
## 6 Independent Very Happy          45.71242   153         40.96732
## 7 Republican  Not too Happy       49.90123    81         38.50617
## 8 Republican  Pretty Happy        50.50790   443         43.70655
## 9 Republican  Very Happy          51.57540   252         46.86508

We can (and should) also give our summaries descriptive names.

gss_bootcamp %>% 
  drop_na(party3, happy_text, newage) %>% 
  group_by(party3, happy_text) %>% 
  summarise(mean_age = mean(newage),
            count = n(),
            mean_occ_prest = mean(PRESTG10))

## `summarise()` has grouped output by 'party3'. You can override using the
## `.groups` argument.

## # A tibble: 9 x 5
## # Groups:   party3 [3]
##   party3      happy_text    mean_age count mean_occ_prest
##   <chr>       <chr>            <dbl> <int>          <dbl>
## 1 Democrat    Not too Happy 48.96512   172       41.44186
## 2 Democrat    Pretty Happy  47.79649   570       43.37544
## 3 Democrat    Very Happy    50.61733   277       46.37545
## 4 Independent Not too Happy 42.92208    77       32.07792
## 5 Independent Pretty Happy  42.824     250       39.376  
## 6 Independent Very Happy    45.71242   153       40.96732
## 7 Republican  Not too Happy 49.90123    81       38.50617
## 8 Republican  Pretty Happy  50.50790   443       43.70655
## 9 Republican  Very Happy    51.57540   252       46.86508

Official Question Time 2

Since the last OQT, we’ve done:

The count() Function
The group_by() Function
The summarize() Function

Three More Quick Functions

`round()`

Oftentimes, we have numbers with a lot of decimals.

x <- 7/3
x

## [1] 2.3333

We can use round() to round those numbers to the desired number of digits after the decimal:

round(x, 2)

## [1] 2.33

We can include it in a mutate() function as well:

gss_bootcamp %>% 
  mutate(decimal = 7/3,
         newdecimal = round(decimal,1)) %>% 
  select(decimal, newdecimal) %>% 
  head(10)

## # A tibble: 10 x 2
##     decimal newdecimal
##       <dbl>      <dbl>
##  1 2.333333        2.3
##  2 2.333333        2.3
##  3 2.333333        2.3
##  4 2.333333        2.3
##  5 2.333333        2.3
##  6 2.333333        2.3
##  7 2.333333        2.3
##  8 2.333333        2.3
##  9 2.333333        2.3
## 10 2.333333        2.3

And include it within a summarize() function:

gss_bootcamp %>% 
  drop_na() %>% 
  group_by(race3) %>% 
  summarize(meantv = mean(newtvhours),
            roundmeantv = round(mean(newtvhours),3))

## # A tibble: 3 x 3
##   race3   meantv roundmeantv
##   <chr>    <dbl>       <dbl>
## 1 Black 3.762376       3.762
## 2 Other 2.492537       2.493
## 3 White 2.648188       2.648

`arrange()`

If we want to arrange a dataframe (or tibble) by a certain value, we can use arrange().

gss_bootcamp %>% 
  drop_na() %>% 
  arrange(newage) %>% 
  select(newage)

## # A tibble: 637 x 1
##    newage
##     <int>
##  1     18
##  2     18
##  3     18
##  4     18
##  5     18
##  6     18
##  7     18
##  8     18
##  9     18
## 10     18
## # ... with 627 more rows

gss_bootcamp %>% 
  drop_na() %>% 
  arrange(newpolviews) %>% 
  select(newpolviews)

## # A tibble: 637 x 1
##    newpolviews 
##    <chr>       
##  1 Conservative
##  2 Conservative
##  3 Conservative
##  4 Conservative
##  5 Conservative
##  6 Conservative
##  7 Conservative
##  8 Conservative
##  9 Conservative
## 10 Conservative
## # ... with 627 more rows

Here, it sorted alphabetically by newpolviews. In this case, since newpolviews has values “Conservative, Moderate, Liberal”, it gave us “Conservative,” which is the first value alphabetically.

`desc()`

arrange() will always sort in order of smallest –> largest for numbers and A –> Z for text. (For factors, it will give in specified order.)

If we want to reverse this, and sort it in descending order, we can use desc(). We typically use it inside arrange(), like so: arrange(desc(VAR)).

gss_bootcamp %>% 
  drop_na() %>% 
  arrange(desc(newage)) %>% 
  select(newage) %>% 
  head(10)

## # A tibble: 10 x 1
##    newage
##     <int>
##  1     88
##  2     87
##  3     87
##  4     86
##  5     86
##  6     86
##  7     86
##  8     86
##  9     85
## 10     85

We can also use it with text.

gss_bootcamp %>% 
  drop_na() %>% 
  arrange(desc(newpolviews)) %>% 
  select(newpolviews) %>% 
  head(10)

## # A tibble: 10 x 1
##    newpolviews
##    <chr>      
##  1 Moderate   
##  2 Moderate   
##  3 Moderate   
##  4 Moderate   
##  5 Moderate   
##  6 Moderate   
##  7 Moderate   
##  8 Moderate   
##  9 Moderate   
## 10 Moderate

Also, since summarize() will give us results in alpha/numerical order, it can be helpful to arrange(desc()) the results, like this:

gss_bootcamp %>% 
  drop_na() %>% 
  group_by(race3) %>% 
  summarize(meanage = mean(newage)) %>% 
  arrange(desc(race3))   # Arrange descending order: race3

## # A tibble: 3 x 2
##   race3  meanage
##   <chr>    <dbl>
## 1 White 49.24520
## 2 Other 39.91045
## 3 Black 45.12871

gss_bootcamp %>% 
  drop_na() %>% 
  group_by(race3) %>% 
  summarize(meanage = mean(newage)) %>% 
  arrange(desc(meanage))  # Arrange descending order: meanage

## # A tibble: 3 x 2
##   race3  meanage
##   <chr>    <dbl>
## 1 White 49.24520
## 2 Black 45.12871
## 3 Other 39.91045

Official Question Time 3

Since the last OQT, we’ve done:

round()
arrange()
desc()

Official Question Time 4

This afternoon, we’ve done:

Factors
Summarizing Data
- count()
- group_by()
- summarize()
round(), arrange(), and desc()

Practice

Create a new GSS dataset “session4”
1. Select the following variables: HEALTH, RACE, COHORT, EDUC, SEX, AFFRMACT, and RANK
  - HEALTH and AFFRMACT should be coded as factors
2. Perform the appropriate data management
3. Drop any rows with missing data
Using “session4”, create the following summary tables
1. Average birth year by somebody’s health
  - Rounded to one decimal place
  - And sorted least to greatest
2. Average social position by views on affirmative action
  - Rounded to two decimal places
3. Count of how many people are in each intersection group of race and sex
  - Sorted from biggest to smallest

Practice Answers

# Part 1
session4 <- GSS %>% 
  
  # Part 1a: Select the following variables: HEALTH, RACE, COHORT, EDUC, SEX, and RANK
  select(HEALTH, RACE, COHORT, EDUC, SEX, AFFRMACT, RANK) %>% 
  
  # Part 1b: Perform the appropriate data management
  mutate(
    health = factor(HEALTH,
                    levels = 1:4,
                    labels = c("Excellent", "Good", "Fair", "Poor")),
    race = case_when(
      RACE == 1 ~ "White",
      RACE == 2 ~ "Black",
      RACE == 3 ~ "Other"),
    cohort = case_when(
      COHORT %in% c(-1, 9999) ~ NA_integer_,
      TRUE ~ COHORT
    ),
    education = case_when(
      EDUC %in% c(-1, 98, 99) ~ NA_integer_,
      TRUE ~ EDUC
      ),
    sex = ifelse(SEX == 1, "Male", "Female"),
    affirmative = factor(AFFRMACT,
                         levels = 1:4,
                         labels = c("Strongly Favors", "Not Strongly Favors", 
                                    "Not Strongly Opposes", "Strongly Opposes")),
    socpos = case_when(
      RANK %in% c(0, 98, 99) ~ NA_integer_,
      TRUE ~ RANK 
      )
    ) %>% 
  
  # Part 1c: Drop any rows with missing data
  drop_na()

# Part 2a: Average birth year by somebody's health
session4 %>% 
  group_by(health) %>% 
  # Round to 1 deimal
  summarize(meanyear = round(mean(cohort),1)) %>% 
  # Sort by year 
  arrange(meanyear)

## # A tibble: 4 x 2
##   health    meanyear
##   <fct>        <dbl>
## 1 Poor        1965.1
## 2 Good        1968.9
## 3 Fair        1968.9
## 4 Excellent   1969.3

# Part 2b: Average social position by views on affirmative action
session4 %>% 
  group_by(affirmative) %>% 
  # Rounded to 2 decimal places 
  summarize(social_position = round(mean(socpos),2))

## # A tibble: 4 x 2
##   affirmative          social_position
##   <fct>                          <dbl>
## 1 Strongly Favors                 4.73
## 2 Not Strongly Favors             4.73
## 3 Not Strongly Opposes            4.71
## 4 Strongly Opposes                4.62

# Part 2c: Counts of race X sex
session4 %>% 
  group_by(sex, race) %>% 
  count() %>% 
  # Sorted from most to least 
  arrange(desc(n))

## # A tibble: 6 x 3
## # Groups:   sex, race [6]
##   sex    race      n
##   <chr>  <chr> <int>
## 1 Female White   288
## 2 Male   White   229
## 3 Female Black    70
## 4 Male   Other    46
## 5 Male   Black    44
## 6 Female Other    34

Official Question Time 5

Since the last OQT, we’ve done:

Practice with summarizing data

Final Official Question Time

Yesterday and Today, we’ve done:

R Basics
Data Management
- Filtering Rows and Selecting Columns
- Coding and Recoding Data
Shortcuts & Operators

Name	Symbol	Shortcut
Assignment Arrow	`<-`	`ALT`/`Option` and `-`
Comment	`#`	`Ctrl`/`Cmd` and `Shift` and `c`
Pipe	`%>%`	`CTRL`/`Cmd` and `Shift` and `m`
Matching Operator / Percent-in-Percent	`%in%`

Three common errors
1. Using one equals sign (=) instead of two (==) for logical tests
2. Forgetting to separate arguments with a comma
3. Forgetting to close a function with )

Looking Forward

Wow, what an action-packed two days it has been!

As you know, we’ll meet up for Soc 541 (aka Stats 1) on Monday, September 12. (No classes this coming Monday for Labor Day.)

Over the next week or so, you should go to the RStudio Primer page (https://rstudio.cloud/learn/primers) and work through the following primers:

The Basics: https://rstudio.cloud/learn/primers/1
1. Programming Basics
  - Don’t do Visualization Basics - we’ll get to that in a few weeks
Work With Data: https://rstudio.cloud/learn/primers/2
1. Working with Tibbles
2. Isolating Data with dplyr
3. Deriving Information with dplyr

RStudio.Cloud is a website that hosts RStudio (free and paid). It also hosts a number of excellent primers that introduce (or, in this case, review) a number of common tasks.

Each primer has a short video (two minutes or less) followed by some questions and coding exercises.

Don’t worry if you see something we haven’t gotten to yet. The primers are fantastic because they have a solutions button: copy and paste, then see if you understand what’s going on. If it doesn’t click, feel free to skip it and move on.

Primer 2.3 (Deriving Information with dplyr) definitely goes beyond what we’ve covered today and what we’ll need for the class.
If you don’t recognize some parts (especially on “ungrouping”), don’t worry about it.

Session 4: Miscellany, Practice, and Wrap Up

Rutgers University Sociology R Bootcamp

The Home Stretch!!

This Afternoon’s Goals

Loading our Dataset

Factors: Our Last Data Type

Data Type Review

Creating Factor Data

Factor Levels

Factor Labels

Official Question Time 1

Practice with Factors

Practice Examples

Summarizing Data

The `count()` Function

The `group_by()` and `summarize()` Functions

Official Question Time 2

Three More Quick Functions

`round()`

`arrange()`

`desc()`

Official Question Time 3

Wrapping Up

Official Question Time 4

Practice

Practice Answers

Official Question Time 5

Final Official Question Time

Looking Forward

Session 4: Miscellany, Practice, and Wrap Up

Rutgers University Sociology R Bootcamp

The Home Stretch!!

This Afternoon’s Goals

Loading our Dataset

Factors: Our Last Data Type

Data Type Review

Creating Factor Data

Factor Levels

Factor Labels

Official Question Time 1

Practice with Factors

Practice Examples

Summarizing Data

The count() Function

The group_by() and summarize() Functions

Official Question Time 2

Three More Quick Functions

round()

arrange()

desc()

Official Question Time 3

Wrapping Up

Official Question Time 4

Practice

Practice Answers

Official Question Time 5

Final Official Question Time

Looking Forward

The `count()` Function

The `group_by()` and `summarize()` Functions

`round()`

`arrange()`

`desc()`