Fred Traylor, Lab TA (he/him)
September 1, 2022
R
Statistics
Computer with:
R
RStudio
Internet Connection
A copy of these slides
These slides will work with any web browser and operate like PowerPoint.
Next Slide: Click, Right arrow key, Pg Dn, Spacebar, or Swipe Left
Last Slide: Left arrow key, Pg Up, or Swipe Right
Home for first slide
End for last slide
C brings up Table of Contents (also click bottom left or swipe up)
B makes things Bigger
S makes things Smaller
F toggles slide Footer
On slides with code snippets, you can click the red clipboard button on the right side of each chunk to copy the code in the snippet. You can then paste it into R for quick running.
We’ll be pairing these slides with some work in R.
Please work through the R as we go so you can get practice with the code
Try to type it yourself to build the “muscle memory”
But copy-paste is also welcome
Feel free to ask questions at any time throughout
We will also have “Official Question Times,” often combined with short breaks
These will signal we have finished learning a particular concept for now
If you’ve been bottling up a question, wondering if I would answer it organically, the OQT means I probably won’t
Intro to R
R Basics
Objects
Groups of Objects
Groups of Groups of Objects
Some basic functions
Two keyboard shortcuts
Why learn and use R, especially if you know another statistics program already?
R can handle nearly any data type
Open-source = Free
Anybody can use and add to it (including you!)
New features and functionality added every day!
Large, worldwide community of “useRs”
Similar to other computer languages
R
A programming language
The program we’re currently working with
RStudio
An IDE (Integrated Development Environment)
“A software application that provides comprehensive facilities to computer programmers for software development” (Wikipedia)
In other words: A program that makes it easier to write in and use a programming language
The program we will be working with
How I made these slides
As I mentioned before, R has a very large, very great online community. One of the premier features they’ve put together is a series of cheatsheets. The following cheatsheets will be helpful for you in learning our bootcamp content.
Base R
Data Transformation with dplyr
Here is the link to others if you lose yours and/or just want to look around: https://www.rstudio.com/resources/cheatsheets/
Four Panes
Console (bottom left)
Source (top left)
Environment / History (top right)
Files / Plots / Packages / Help / Viewer (bottom right)
Editable, but this is default and it works well for most everyone.
(If you don’t see “Source” pane, click one of the buttons on top right
of Console pane.)
If you click in the console (bottom left), you can begin typing code into R.
To run it, simply hit Enter
, and R will run what you’ve
typed.
Let’s get started with that.
At its most basic, R (like all computers) is just a calculator. We can do addition, subtraction, multiplication, division, long functions, complex functions, (nearly) anything.
## [1] 10
## [1] -1
## [1] 90
## [1] 2
## [1] -45180
## [1] -1286
While you can type everything directly into the console pane (bottom left), it is good practice to begin typing your script into the source pane (top left).
Easy to go back, see what you’ve run, change things, and rerun without having to retype everything
Eventually, you can run the entire source code at once
To run a line from the source pane: Press Ctrl
+
Enter
(Cmd
+ Enter
on a Mac) and
R will run everything it thinks you want. You can also click the
“Run” button in the top right.
If you have something highlighted, R will run ONLY the highlighted code
If you don’t have something highlighted, R will run:
The current line, where your cursor is
Anything after, if your script isn’t finished
Anything before, if it thinks the previous line leads to the current one
Sometimes we want to store numbers so we can reference them again. We
can use the assignment arrow <-
to assign values to
x
and y
.
## [1] 10
## [1] 17
Because we used x
to equal the sum of 8 and 2, (10) we
can use x
later on to calculate y
.
Assignment (aka saving something to something else) can be done with
both =
and <-
.
For the remainder of this bootcamp and through this next year, we’ll
be using <-
to assign.
This is the standard method because it makes it clear which item is being assigned to which.
To make <-
quickly, press ALT
and
-
(the minus key) at the same time.
Option
and -
.If you look at the top right pane of your RStudio window, you’ll see
two “Values” being stored: x
, set at 10, and
y
, which is set to 17.
As we go along, you can always take a look at the environment to see what you have stored and what’s inside.
While it’s nice to use R just as a plain calculator, we almost always have more than one value we’re working with at a time.
We can create “vectors” of these items with the c()
command. (c
stands for
“concatenate.”)
## [1] 2 4 6 8 14
What we did here was create an object called “scores,” which has five values. We then printed it to make sure it was what we wanted. (Not always necessary, but it doesn’t hurt to double-check.)
When we use a math operator (like addition or division) on a vector, it uses that operator on each piece of the vector. For example:
## [1] 3 6 9 12 21
Above, I multiplied scores
times 1.5 and assigned this
to a new object, newscores.
I then printed it to show what
our new scores are.
If we want to pull a specific value from our vector, we can use what’s called “indexing.”
We use square brackets [ ]
to do this.
It takes the form: vectorname[element]
For example, if we want the third object of scores
, we
can index it like so:
## [1] 2 4 6 8 14
## [1] 6
We can also ask for more than one element in a vector at a time like so:
## [1] 4 6
See what we did there? We used a vector ( c(2,3)
) to
index another vector (scores
).
Since we started, we’ve done:
What is R? What is RStudio?
R as a Calculator
Storing Objects
Vectors
Indexing a Vector
We can also use R for things that are not just numbers. Let’s create a vector of words.
Because R wants things to be numbers, the above code doesn’t work. Instead, we have to put quotation marks around everything that is text.
## [1] "Oklahoma City" "Dallas" "Charlotte" "Piscataway"
As an aside, R is great at being able to handle object names that are both very short and very long. There are a few rules R makes you follow for naming:
Names must start with a letter or a period, though it is better to start with a letter.
Names can only contain letters, numbers, underscores, and periods.
You can’t use certain special keywords as names.
There are however, a few naming conventions. The key one is that names should be as short as possible while still remaining accurate.
Our vector “cities
” could have also been called
“cities_where_fred_has_lived
,” but the name is
unnecessarily long, so “cities
” works better for now.
If we have vectors of cities for more than one person, it’d be better
then to call it fred_cities
or cities_fred
so
it can be distinguished from someone else’s cities (like
cities_tom
or cities_quan
).
Finally, R won’t stop you from overwriting another object, so use caution with common names like “data” or “file” or “vector.”
R can also perform “logical tests” for us. These ask whether two values are equal to each other. For example:
## [1] FALSE
## [1] TRUE
## [1] TRUE
When performing these, we use two equal signs (==
) so
that it knows what we’re trying to do.
If you use only one (e.g. 1=3
), R will give you an
error because it thinks you’re trying to assign 1
to
3
:
Error in 1 = 3 : invalid (do_set) left-hand side to assignment
Common Error #1: Using one equals sign instead of two when performing logical tests
R can also do logical tests for greater than (>
),
less than (<
), greater than or equal to
(>=
), and less than or equal to
(<=
).
## [1] FALSE
## [1] TRUE
Let’s say our scores from earlier are considered passing if they are
larger than five. We can create a vector, called passing
,
that contains whether each score is greater than five.
## [1] 2 4 6 8 14
## [1] FALSE FALSE TRUE TRUE TRUE
We’ve worked with three types of data so far.
In case we forget what type our data is, R has functions build in that can help us remember. Let’s take another look and see what’s inside them.
## [1] 2 4 6 8 14
## [1] "Oklahoma City" "Dallas" "Charlotte" "Piscataway"
## [1] FALSE FALSE TRUE TRUE TRUE
So the first vector (scores
) has numbers. The second
vector (cities
) has text, and the third
(passing
) has logic (TRUE
and
FALSE
).
To make sure R has them correct, we can use the class()
function on each of them, and R will tell us what kind they are.
## [1] "numeric"
## [1] "character"
## [1] "logical"
Since the last OQT, we’ve done:
Text Data
Object naming conventions
Logical tests and logical data
Logical tests over a vector
Data type review
Common Error #1: Using =
instead of ==
when performing logical tests
Beyond the base calculator options, functions are how things get done in R. We’ve already used two functions so far:
c()
, to concatenate several values
into a vectorclass()
, to see how R has stored an objectLet’s try out a few more useful functions on the next few
slides.
Each function takes the form:
func(argument, argument, ..., etc)
func
is the name of the function
The name is immediately followed by an opening parenthesis
Inside the parentheses is a vector of “arguments,” and each argument is separated by a comma
The function ends with a closing parenthesis
If you forget to close it, R won’t know the function is over.
Common Error #3: Forgetting to close parentheses
seq()
seq()
creates a sequence of numbers. It
takes the form: seq(from, to, by)
. Let’s examine the
function by typing ?seq
into the console and pressing
enter. This will give us the documentation for the
sequence function.
help(seq)
or
help(sequence)
We can start by creating a sequence of numbers from 3, to 27, counting by 3’s.
## [1] 3 6 9 12 15 18 21 24 27
There are a few other ways we can write the same function, though. We could put the arguments in a different order.
## [1] 3 6 9 12 15 18 21 24 27
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We could also omit the names of the arguments altogether, BUT they have to be in the same order as the original function. (Otherwise, R will throw an error).
## [1] 3 6 9 12 15 18 21 24 27
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
length()
, head()
, and
tail()
Let’s create a really long vector using seq()
.
## [1] 5 17 29 41 53 65 77 89 101 113 125 137 149 161 173
## [16] 185 197 209 221 233 245 257 269 281 293 305 317 329 341 353
## [31] 365 377 389 401 413 425 437 449 461 473 485 497 509 521 533
## [46] 545 557 569 581 593 605 617 629 641 653 665 677 689 701 713
## [61] 725 737 749 761 773 785 797 809 821 833 845 857 869 881 893
## [76] 905 917 929 941 953 965 977 989 1001 1013 1025 1037 1049 1061 1073
## [91] 1085 1097 1109 1121 1133 1145 1157 1169 1181 1193 1205 1217 1229 1241 1253
## [106] 1265 1277 1289 1301 1313 1325 1337 1349 1361 1373 1385 1397 1409 1421 1433
## [121] 1445 1457 1469 1481 1493 1505 1517 1529 1541 1553 1565 1577 1589 1601 1613
## [136] 1625 1637 1649 1661 1673 1685 1697 1709 1721 1733 1745 1757 1769 1781 1793
## [151] 1805 1817 1829 1841 1853 1865 1877 1889 1901 1913 1925 1937 1949 1961 1973
## [166] 1985 1997 2009 2021 2033 2045 2057 2069 2081 2093 2105 2117 2129 2141 2153
## [181] 2165 2177 2189 2201 2213 2225 2237 2249 2261 2273 2285 2297 2309 2321 2333
## [196] 2345 2357 2369 2381 2393 2405 2417 2429 2441 2453 2465 2477 2489 2501 2513
## [211] 2525 2537 2549 2561 2573 2585 2597 2609 2621 2633 2645 2657 2669 2681 2693
## [226] 2705 2717 2729 2741 2753 2765 2777 2789 2801 2813 2825 2837 2849 2861 2873
## [241] 2885 2897 2909 2921 2933 2945 2957 2969 2981 2993 3005 3017 3029 3041 3053
## [256] 3065 3077 3089 3101 3113 3125 3137 3149 3161 3173 3185 3197 3209 3221 3233
## [271] 3245 3257 3269 3281 3293 3305 3317 3329 3341 3353 3365 3377 3389 3401 3413
## [286] 3425 3437 3449 3461 3473 3485 3497 3509 3521 3533 3545 3557 3569 3581 3593
## [301] 3605 3617 3629 3641 3653 3665 3677 3689 3701 3713 3725 3737 3749 3761 3773
## [316] 3785 3797 3809 3821 3833 3845 3857 3869 3881 3893 3905 3917 3929 3941 3953
## [331] 3965 3977 3989 4001 4013 4025 4037 4049 4061 4073 4085 4097 4109 4121 4133
## [346] 4145 4157 4169 4181 4193 4205 4217 4229 4241 4253 4265 4277 4289 4301 4313
## [361] 4325 4337 4349 4361 4373 4385 4397 4409 4421 4433 4445 4457 4469 4481 4493
## [376] 4505 4517 4529 4541 4553 4565 4577 4589 4601 4613 4625 4637 4649 4661 4673
## [391] 4685 4697 4709 4721 4733 4745 4757 4769 4781 4793 4805 4817 4829 4841 4853
## [406] 4865 4877 4889 4901 4913 4925 4937 4949 4961 4973 4985 4997 5009 5021 5033
## [421] 5045 5057 5069 5081 5093 5105 5117 5129 5141 5153 5165 5177 5189 5201 5213
## [436] 5225 5237 5249 5261 5273 5285 5297 5309 5321 5333 5345 5357 5369 5381 5393
## [451] 5405 5417 5429 5441 5453 5465 5477 5489 5501 5513 5525 5537 5549 5561 5573
## [466] 5585 5597 5609 5621 5633 5645 5657 5669 5681 5693 5705 5717 5729 5741 5753
## [481] 5765 5777 5789 5801 5813 5825 5837 5849 5861 5873 5885 5897 5909 5921 5933
## [496] 5945 5957 5969 5981 5993 6005 6017 6029 6041 6053 6065 6077 6089 6101 6113
## [511] 6125 6137 6149 6161 6173 6185 6197 6209 6221 6233 6245 6257 6269 6281 6293
## [526] 6305 6317 6329 6341 6353 6365 6377 6389 6401 6413 6425 6437 6449 6461 6473
## [541] 6485 6497 6509 6521 6533 6545 6557 6569 6581 6593 6605 6617 6629 6641 6653
## [556] 6665 6677 6689 6701 6713 6725 6737 6749 6761 6773 6785 6797 6809 6821 6833
## [571] 6845 6857 6869 6881 6893 6905 6917 6929 6941 6953 6965 6977 6989 7001 7013
## [586] 7025 7037 7049 7061 7073 7085 7097 7109 7121 7133 7145 7157 7169 7181 7193
## [601] 7205 7217 7229 7241 7253 7265 7277 7289 7301 7313 7325 7337 7349 7361 7373
## [616] 7385 7397 7409 7421 7433 7445 7457 7469 7481 7493 7505 7517 7529 7541 7553
## [631] 7565 7577 7589 7601 7613 7625 7637 7649 7661 7673 7685 7697 7709 7721 7733
## [646] 7745 7757 7769 7781 7793 7805 7817 7829 7841 7853 7865 7877 7889 7901 7913
## [661] 7925 7937 7949 7961 7973 7985 7997 8009 8021 8033 8045 8057 8069 8081 8093
## [676] 8105 8117 8129 8141 8153 8165 8177 8189 8201 8213 8225 8237 8249 8261 8273
## [691] 8285 8297 8309 8321 8333 8345 8357 8369 8381 8393 8405 8417 8429 8441 8453
## [706] 8465 8477 8489 8501 8513 8525 8537 8549 8561 8573 8585 8597 8609 8621 8633
## [721] 8645 8657 8669 8681 8693 8705 8717 8729 8741 8753 8765 8777 8789 8801 8813
## [736] 8825 8837 8849 8861 8873 8885 8897 8909 8921 8933 8945 8957 8969 8981 8993
## [751] 9005 9017 9029 9041 9053 9065 9077 9089 9101 9113 9125 9137 9149 9161 9173
## [766] 9185 9197 9209 9221 9233 9245 9257 9269 9281 9293 9305 9317 9329 9341 9353
## [781] 9365 9377 9389 9401 9413 9425 9437 9449 9461 9473 9485 9497 9509 9521 9533
## [796] 9545 9557 9569 9581 9593 9605 9617 9629 9641 9653 9665 9677 9689 9701 9713
## [811] 9725 9737 9749 9761 9773 9785 9797 9809 9821 9833 9845 9857 9869 9881 9893
## [826] 9905 9917 9929 9941 9953 9965 9977 9989
Luckily, RStudio gives us numbers on the side to help us keep track of things, but what if we want the exact number? (Say we want to store it as an object for later usage.)
We can use the length()
function to see exactly how long
the vector is.
## [1] 833
## [1] 833
We can also use the head()
and tail()
functions to get the first or last howevermany values.
If we look at their documentation (?head
or
?tail
), we can see they have two arguments: the object and
the number of values. We can also see that there is a default of 6
values. This default means that, unless we specify how many we want, it
will give us 6 values. (This is helpful if we want a quick look and
don’t care how many we get.)
## [1] 5 17 29 41 53 65
## [1] 5 17
If we look at the documentation for tail()
, we can see
that it just gives us the documentation for head()
. This is
because they are essentially the same function, just for different ends
of the vector.
## [1] 9749 9761 9773 9785 9797 9809 9821 9833 9845 9857 9869 9881 9893 9905 9917
## [16] 9929 9941 9953 9965 9977 9989
Since the last OQT, we’ve done:
Functions
seq()
length()
head()
tail()
Common Error #2: Forgetting to separate arguments with a comma
Common Error #3: Forgetting to close a function with a
parenthesis )
Comments
Ctrl
/Cmd
+ Shift
+
c
There are two main ways that we can store multiple vectors together: Lists and Data Frames.
We’re going to skip lists for now. They’ll become (minorly) important in a few months, so it’s not worth looking at just yet.
Data frames are R objects that combine vectors both horizontally and vertically. If you’ve worked with any sort of data before, including Excel sheets, data frames will look familiar to you.
Data frames are also how we will be doing the vast majority of our work this year.
Let’s create one using our previous vectors.
## scores newscores
## 1 2 3
## 2 4 6
## 3 6 9
## 4 8 12
## 5 14 21
We can see that our data frame “firstdf
” has names for
the columns. Fortunately, they do not take up a row themselves. The
numbers on the left also don’t count, but are helpful for our purposes
later.
There are several very helpful functions for working with data frames.
For example, we can use the ncol()
function to see the
number of columns there
are and nrow()
to see the number of
rows there are.
If we wanted them together, we could call dim(firstdf)
,
short for dimensions, and it would give us
both.
## [1] 2
## [1] 5
## [1] 5 2
Lastly, we can also use the function str()
to analyze
the structure of the data frame. (This works
better for small data frames but gets unwieldy very easily when used
with larger ones.)
## 'data.frame': 5 obs. of 2 variables:
## $ scores : num 2 4 6 8 14
## $ newscores: num 3 6 9 12 21
This function is neat because it gives us
The class of the object (“data.frame
”)
The dimensions (“5 obs. of 2 variables
”)
Summaries of each column
Name
Data type (numeric, logical, character)
And the first few values
If we want to extract certain values from firstdf
, we
can index them. Indexing data frames takes the form
dataframe[row,column]
.
If we index the first row (firstdf[1, ]
), for example,
we’ll get the first row’s numbers (2, 3).
## scores newscores
## 1 2 3
## [1] 2 4 6 8 14
## [1] 2
The comma is important here. Without it, R will give us just the column. And if we do two brackets without a comma, it’ll give us the column as a vector.
## scores newscores
## 1 2 3
## 2 4 6
## 3 6 9
## 4 8 12
## 5 14 21
## scores
## 1 2
## 2 4
## 3 6
## 4 8
## 5 14
## [1] 2 4 6 8 14
Remember how data frames are groups of vectors? Well we can treat
each column as a vector, complete with its own name. Let’s start by
seeing what those names are. We can use colnames()
or just
names()
to do so.
## [1] "scores" "newscores"
## [1] "scores" "newscores"
We can also use rownames()
to get the names of our
rows.
## [1] "1" "2" "3" "4" "5"
Our rows are currently just numbers, so let’s assign them something else.
## [1] "1" "2" "3" "4" "5"
## [1] "karen" "quan" "tom" "kristen" "paul"
See what we did there?
rownames(firstdf)
would give us a vector
of our row names.<-
) to assign each of
our row names to a new name, found in the vector to the right of
it.Let’s try it again, this time renaming the columns.
## [1] "before_curve" "after_curve"
## before_curve after_curve
## karen 2 3
## quan 4 6
## tom 6 9
## kristen 8 12
## paul 14 21
Just like we were able to index rows and columns with their numbers
(e.g. firstdf[1,]
) we can do the name using their row
and/or column names. We canalso use the dollar sign ($
) to
select a column from a data frame (but not a row).
When we index with a column name like this, we don’t need to include a comma. It already knows which column, so all it needs now is a row.
Error in firstdf$before_curve[, 3] : incorrect number of dimensions
## [1] 2 4 6 8 14
## [1] 6
Because data frames are just groups of vectors, we can use functions on them.
firstdf$column_three <- seq(8, 12) # Number of new values has to be exact same as current df length
firstdf$c1_plus_c2 <- firstdf$before_curve + firstdf$after_curve
firstdf
## before_curve after_curve column_three c1_plus_c2
## karen 2 3 8 5
## quan 4 6 9 10
## tom 6 9 10 15
## kristen 8 12 11 20
## paul 14 21 12 35
We can also use head()
and tail()
with data
frames.
## before_curve after_curve column_three c1_plus_c2
## karen 2 3 8 5
## quan 4 6 9 10
## before_curve after_curve column_three c1_plus_c2
## quan 4 6 9 10
## tom 6 9 10 15
## kristen 8 12 11 20
## paul 14 21 12 35
And with columns inside those data frames.
## [1] 8 9 10 11 12
Since the last OQT, we’ve done:
Data Frames
Creating
Indexing
Row and Column Length
Row and Column Naming
Functions: head()
and tail()
View()
With any data structure, you can use the View()
function
to see inside. It works fine for vectors, but becomes extremely helpful
with dataframes. Go ahead and run View(firstdf)
now.
Note that the V
is capitalized.
You can also click on the dataframe in the environment window on the right.
Click on the name and it’ll run View()
for
you.
Click on the arrow to the left of the name and it’ll give you a
short version of str()
.
You’ll see something like this in the source pane:
table()
FunctionAnother small dataset
For the next few minutes, we’ll be working with a small dataset. Let’s build it below:
smalldata <- data.frame(colors = c("Scarlet", "Black", "Black", "Scarlet", "Scarlet"),
teams = c("Knights", "Hoosiers", "Knights", "Knights", "Hoosiers"))
Let’s take a look at what’s inside:
## colors teams
## 1 Scarlet Knights
## 2 Black Hoosiers
## 3 Black Knights
## 4 Scarlet Knights
## 5 Scarlet Hoosiers
## 'data.frame': 5 obs. of 2 variables:
## $ colors: chr "Scarlet" "Black" "Black" "Scarlet" ...
## $ teams : chr "Knights" "Hoosiers" "Knights" "Knights" ...
So we can see that we’re working with a data frame that has 5 observations of 2 variables. Both variables are character, too.
The table()
function
The table()
function gives us some quick counts of our
data. We can use it when we want to look at how frequently one or two
variables appear.
##
## Black Scarlet
## 2 3
##
## Hoosiers Knights
## Black 1 1
## Scarlet 1 2
The first variable listed will always go on the left, and the second variable across the top.
We can see from the dataset itself, that “Black” is used twice and “Scarlet” three times. This is reflected in the tables we made as well.
And we can see that Black is on the same line as “Hoosiers” once, so there is a 1 where Black and Hoosiers intersect on the second table. Similarly, there are two rows in the data frame that have both Scarlet and Knights, so the cell where they meet in the second table has a 2.
Since the last OQT, we’ve done:
Viewing data
View()
str()
table()
After we assigned our first two values, I asked you to take a look at the “Environment” pane of your RStudio window (in the top right). Now that we’ve added more since then, take another look. In that pane, you’ll see everything we’ve made so far, separated into “Data” (which includes data frames and lists) and “Values” (which holds vectors and standalone values).
Comments
In your R source code (top left of the window), you can (and should) make comments to help remind you of things. Anything following a number sign/pound sign/hashtag/octothorpe (this thing:
#
) will not be run.This is very useful when you’re writing code/script and want to keep a note of something. This is especially helpful if you’re sharing your code with another person (or even Future You), since it let’s you label what you did and why.
Throughout this year, you should also add comments onto your homework assignments so that Quan, Tom, and I know what you’re doing. It’s part of showing your work. I’ll provide examples of this when we do big practice sessions later on.
It’s also a great way to keep notes when we do lab sessions in class.
A quick way to “comment out” a whole line or group of lines is to press
Ctrl
+Shift
+c
(Cmd
+Shift
+c
on a Mac)This means we now have learned two R shortcuts:
ALT
and-
(Option
and-
on a Mac)Ctrl
+Shift
+c
(Cmd
+Shift
+c
on a Mac)(One more to come tomorrow)