In this class, we’ll learn how to use the package
ggplot2 to visualize your data. In addition to this
document, there are a ton of resources to help you learn and understand
ggplot. Some are linked in this document.
library(ggplot2)
library(dplyr)
library(tidyr)
library(RColorBrewer) # used for color-blind friendly palettes
library(palmerpenguins) # used for data in the final comprehension check
# if you want to reproduce the full document, you will also need the fontawesome package. 
Descriptive and inferential statistics are incredibly useful for understanding your data. However, as we have discussed, they are models and models always have error. We have to live with a certain amount of error, but we should make sure that our model is at least useful. For example, is a statement like “the vast majority of humans have more than the mean number of eyes” all that useful?
One way that we can assess the usefulness of our numeric models is to visualize the data.
(Note: For much of the code in this first section, don’t stress if you don’t understand everything the code is doing. Much of this is for demonstration purposes and the parts that we expect you to understand will be explained in more detail.)
Let’s look at the “anscombe” data set that comes with the base R
package datasets.
data("anscombe")
str(anscombe)
## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
head(anscombe)
##   x1 x2 x3 x4   y1   y2    y3   y4
## 1 10 10 10  8 8.04 9.14  7.46 6.58
## 2  8  8  8  8 6.95 8.14  6.77 5.76
## 3 13 13 13  8 7.58 8.74 12.74 7.71
## 4  9  9  9  8 8.81 8.77  7.11 8.84
## 5 11 11 11  8 8.33 9.26  7.81 8.47
## 6 14 14 14  8 9.96 8.10  8.84 7.04
The anscombe data set has eight variables, each with 11 observations.
We’ll use dplyr to get summary statistics. But before,
let’s convert the data into a “long” format so that it’s a bit easier to
work with. We’ll use the tidyr function
pivot_longer. The argument cols selects the
columns that you want to pivot. Since we only have numeric variables, we
can pivot everything to a long format. To do that, we’ll use
cols = everything(). The anscombe data contains 4 sets of
\(x\) and \(y\) data. The argument
names_to and names_pattern work in tandem to
restructure the data.
anscombe <- anscombe |> 
  pivot_longer(cols = everything(), 
                          names_to = c(".value", "set"), 
                          names_pattern = "(.)(.)")
str(anscombe)
## tibble [44 × 3] (S3: tbl_df/tbl/data.frame)
##  $ set: chr [1:44] "1" "2" "3" "4" ...
##  $ x  : num [1:44] 10 10 10 8 8 8 8 8 13 13 ...
##  $ y  : num [1:44] 8.04 9.14 7.46 6.58 6.95 8.14 6.77 5.76 7.58 8.74 ...
head(anscombe)
## # A tibble: 6 × 3
##   set       x     y
##   <chr> <dbl> <dbl>
## 1 1        10  8.04
## 2 2        10  9.14
## 3 3        10  7.46
## 4 4         8  6.58
## 5 1         8  6.95
## 6 2         8  8.14
Now, we’ll use dplyr functions to get summary statistics
for x and y (for each set).
anscombe |> 
  group_by(set) |> 
  summarise(mean_x = mean(x),
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y
                      )
            )
## # A tibble: 4 × 5
##   set   mean_x  sd_x mean_y  sd_y
##   <chr>  <dbl> <dbl>  <dbl> <dbl>
## 1 1          9  3.32   7.50  2.03
## 2 2          9  3.32   7.50  2.03
## 3 3          9  3.32   7.5   2.03
## 4 4          9  3.32   7.50  2.03
### NON-DPLYR APPROACH
## you can use the function `tapply` to get summary stats by a grouping in a data set
## e.g., 
# tapply(X = anscombe$y, INDEX = anscombe$set, FUN = mean)
## you would then need to repeat this function, changing the X and FUN arguments 
What do you notice about the four sets?
For all four sets, the mean of x, mean of y, standard deviation of x, and standard deviation of y are the same. So, the models that we have used most in this class, tell us that these sets are pretty much the same!
Let’s go further and fit a linear regression model to each of the four data sets.
fit_by_set <- 
  by(data = anscombe, INDICES = anscombe$set, FUN = function(subset) {
      lm(y ~ x, data = subset)
      }
    )
fit_by_set
## anscombe$set: 1
## 
## Call:
## lm(formula = y ~ x, data = subset)
## 
## Coefficients:
## (Intercept)            x  
##      3.0001       0.5001  
## 
## ------------------------------------------------------------ 
## anscombe$set: 2
## 
## Call:
## lm(formula = y ~ x, data = subset)
## 
## Coefficients:
## (Intercept)            x  
##       3.001        0.500  
## 
## ------------------------------------------------------------ 
## anscombe$set: 3
## 
## Call:
## lm(formula = y ~ x, data = subset)
## 
## Coefficients:
## (Intercept)            x  
##      3.0025       0.4997  
## 
## ------------------------------------------------------------ 
## anscombe$set: 4
## 
## Call:
## lm(formula = y ~ x, data = subset)
## 
## Coefficients:
## (Intercept)            x  
##      3.0017       0.4999
The coefficient estimates are all (basically) the same. All have an intercept around 3 and a slope around 0.5.
TMI (“Too much information!”)
Linear model check
(If you’ve seen other sections of the webpage, you’ll have seen the “TMI” sections. These sections give a little more detail but are not necessary for understanding the broader concepts.)
We can check other measures associated with linear models. In fact, when you fit a model I would suggest checking these values in addition to other diagnostic checks that I won’t show here.
fit_summary_by_set <- lapply(fit_by_set, summary)
Adjusted R-Square:
cat("Adjusted R-Square:\n")
## Adjusted R-Square:
lapply(fit_summary_by_set, FUN = function(i){ i$adj.r.squared })
## $`1`
## [1] 0.6294916
## 
## $`2`
## [1] 0.6291578
## 
## $`3`
## [1] 0.6292489
## 
## $`4`
## [1] 0.6296747
Residual standard error
cat("Residual Standard Error:\n")
## Residual Standard Error:
lapply(fit_summary_by_set, FUN = function(i){ i$sigma })
## $`1`
## [1] 1.236603
## 
## $`2`
## [1] 1.237214
## 
## $`3`
## [1] 1.236311
## 
## $`4`
## [1] 1.235695
The F-Statistic
cat("F-Statistic (with degrees of freedom:\n")
## F-Statistic (with degrees of freedom:
(fstat_by_set <- lapply(fit_summary_by_set, FUN = function(i){ i$fstatistic }))
## $`1`
##    value    numdf    dendf 
## 17.98994  1.00000  9.00000 
## 
## $`2`
##    value    numdf    dendf 
## 17.96565  1.00000  9.00000 
## 
## $`3`
##    value    numdf    dendf 
## 17.97228  1.00000  9.00000 
## 
## $`4`
##    value    numdf    dendf 
## 18.00329  1.00000  9.00000
The f-statistic, along with two degree of freedom values, is used to
calculate the significance of the overall model. We can use the function
pf. (the p is the same p in the
pnorm, punif, pgamma, etc
functions; f stands for the F distribution).
cat("Overall model p-value: ")
## Overall model p-value:
lapply(fstat_by_set, FUN = function(i) {pf(i['value'], i['numdf'], i['dendf'], lower.tail = FALSE)})
## $`1`
##       value 
## 0.002169629 
## 
## $`2`
##       value 
## 0.002178816 
## 
## $`3`
##       value 
## 0.002176305 
## 
## $`4`
##       value 
## 0.002164602
cat("Coefficient values: ")
## Coefficient values:
lapply(fit_summary_by_set, FUN = function(i) {i$coefficients})
## $`1`
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0000909  1.1247468 2.667348 0.025734051
## x           0.5000909  0.1179055 4.241455 0.002169629
## 
## $`2`
##             Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.000909  1.1253024 2.666758 0.025758941
## x           0.500000  0.1179637 4.238590 0.002178816
## 
## $`3`
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0024545  1.1244812 2.670080 0.025619109
## x           0.4997273  0.1178777 4.239372 0.002176305
## 
## $`4`
##              Estimate Std. Error  t value    Pr(>|t|)
## (Intercept) 3.0017273  1.1239211 2.670763 0.025590425
## x           0.4999091  0.1178189 4.243028 0.002164602
So, we’re all done, right? We’ve checked the mean and standard deviation, we’ve fit a linear regression model and then checked the coefficients (as well as the adjusted R-squared value, the standard error of the residuals, the f-statistic, and the significance level for the coefficients).
All of these samples have to come from the population, right? And they have to show the same relationship between variables… Right?
Let’s plot the model with the data. Model outputs and tables are necessary and informative but a good visual makes your findings much more accessible and understandable.
Let’s plot just the regression line for the first set. Most of the prior code is more complicated than you will need for this class ( and it’s complicated because the data set we are using is pretty much 4 data sets that I’ve kept together because it’s more concise to write). But for this example I’ll extract out the data for set 1 so that it is easier to reproduce in your analyses.
set1_data <- anscombe[which(anscombe$set == "1"), ]
# or you could use the Dplyr way: 
#set1_data <- filter(anscombe, set == "1")
head(set1_data)
## # A tibble: 6 × 3
##   set       x     y
##   <chr> <dbl> <dbl>
## 1 1        10  8.04
## 2 1         8  6.95
## 3 1        13  7.58
## 4 1         9  8.81
## 5 1        11  8.33
## 6 1        14  9.96
We’ll refit the model, again so you will have a more reproducible example. The underlying processes are the same as earlier so the results won’t change (at least not drastically)
(fit <- lm(y ~ x, data = set1_data)) # parentheses around an assignment expression allow you to assign and print at the same time
## 
## Call:
## lm(formula = y ~ x, data = set1_data)
## 
## Coefficients:
## (Intercept)            x  
##      3.0001       0.5001
ggplot2 operates by using layers. This gives you the
flexibility to add multiple layers so that you can overlay information
on top of other information. To start, every plot has base layer that
has no overlaid data but that does use your data set to initialize the
axes (the ranges, labels, ticks, etc. can all be modified later).
To create this base layer you should specify the data set (using the
data parameter) and the data within the data set that you
will be mapping to the plot area (using the mapping
parameter). mapping usually takes the function
aes as an argument. “aes” is short for “aesthetic”; this
function takes the variables you want to visualize.
ggplot(data = set1_data, mapping = aes(x = x, y = y))

To add the actual data points, you have to add another layer over the
base layer. Functions that start with geom_ are usually
used to plot the data.
ggplot(data = set1_data, aes(x = x, y = y))+
  geom_point()

The data set that we passed to the data parameter
doesn’t have the information on the model, so we have to add this
information. For a linear model, we are most interested in the
coefficients for the slope and the intercept. We can extract those
estimates from our model fit by using the function
coef.
coef(fit)
## (Intercept)           x 
##   3.0000909   0.5000909
The function coef returns a vector of numeric values.
For this model, where we have a single predictor variable, we expect to
have two coefficients: the intercept and the slope for “x”. So, we
should have a vector of length 2.
length(coef(fit))
## [1] 2
Remember that you can get the nth value in a vector by
using the square brackets and using the square brackets:
vect <- c(54, 32, 94, 12) # vector of length 4
vect[2] # get the 2nd element in the vector
## [1] 32
vect[4] # get the 4th element in the vector
## [1] 12
We can do the same with the coef function:
coef_vector <- coef(fit)
coef_vector[1] # the intercept
## (Intercept) 
##    3.000091
coef_vector[2] # the slope for x
##         x 
## 0.5000909
Because coef() returns a named vector,
we can access its contents by using the actual names of the
coefficients. Generally speaking, this is a much safer option. Let’s go
ahead and store the intercept and slope into separate variables.
(intercept <- coef_vector["(Intercept)"]) # parentheses around an assignment expression will perform assignment and then print the contents of the assigned object
## (Intercept) 
##    3.000091
(slopex <- coef_vector["x"])
##         x 
## 0.5000909
Important
When you are trying to extract a value in a vector, a list, a matrix,
a data frame, etc it can often be risky to use a numeric index. When
analyzing data, you often aren’t necessarily interested in the 3rd
element in a list, or the 6th column in a data frame. You are more
likely to be interested in subject CT004’s score, or the column
“weight”, for instance. If a numeric index is not inherent to the
measure of interest, it’s possible the number could change. You could
add a column in your data set and now weight is no longer the 6th column
but the 7th. If your code has something like df[,6] you may
no longer be analyzing weight, but whatever column was inserted. This is
a type of error that is completely opposite to “fail hard and fail
fast”; these types of errors are really tough to notice and the computer
has no reason to think it’s an error. In the model that we fit above,
what if we added a predictor so that our model was now
y ~ x1 + x? Now coef_vector[2] refers to the
estimate for x1 and not for x.
coef_vector["x"] would refer to the estimate for the
coefficient “x” no matter what other variables are added to the
model.
TMI
You can skip the intermediate step of creating the
coef_vector object by indexing directly on the function
call.
(intercept <- coef(fit)["(Intercept)"])
## (Intercept) 
##    3.000091
(slopex <- coef(fit)["x"])
##         x 
## 0.5000909
ggplot(data = set1_data, aes(x = x, y = y))+
  geom_point()+
  geom_abline(slope = slopex, intercept = intercept)

This looks like a pretty decent fit.
TMI
Note!: We are walking through each line of this plot so that you can
see what each layer/line of code does. By using the +
operator, the lines of code are chained together. If we want to end up
with the plot above, we wouldn’t need to run the code for the previous
plots separately.
Let’s check out the second data set.
Check your understanding
anscombe and filter the
data so that only the rows from set 2 are returned. Assign the filtered
data to a variable called set2_data. Do all of this in the
code chunk below.## WRITE YOUR CODE HERE!
#
#
#
#
#
#
lm
function in a variable called fit.### YOUR CODE HERE
#
#
#
#
#
#
coef to extract the intercept and the
slope from the model fit. You should assign the value for the intercept
to an object named intercept and the coefficient for “x” to
an object named slopex.### YOUR CODE HERE
#
#
#
#
#
#
### YOUR CODE HERE
#
#
#
#
#
#
How does the plot look? Does the regression line look like it fits? How much confidence do you have in the model fit?
Go ahead and plot set 3 and 4. (You can fit a linear model if you want, or you can just use the fit from the last data set. We’ve already seen that the intercepts and slopes are close to the same - at least for our purposes.)
### YOUR CODE HERE
#
#
#
#
#
#
### YOUR CODE HERE
#
#
#
#
#
#
By now, you should have noticed that these data sets are quite different, even though many of our statistical summaries/models suggested that they were the same.
ans_plot <- ggplot(data = anscombe, aes(x = x, y = y)) + # plots can be stored in variables as well
  geom_point() +
  facet_wrap(~set) # a ggplot2 function that allows multi-panel plots
ans_plot

These data look pretty distinct. If we add a regression line (slope and intercept), we see that the lines are the same.
ans_plot +
   geom_smooth(formula = y ~ x, method = "lm", se = FALSE, data = anscombe) # ggplot2 even has some modelling functions; these are good for doing quick exploratory work but I wouldn't rely on them too much

You should always plot your data! Linear regression, the mean, the median, standard deviation, etc are all models and, as such, necessarily simplify your data. Plotting can help you better understand if that simplification is actually useful.
Important
You should always plot your data. However, when you plot your data depends on the type of research you are doing. By that, I mean whether you are generating hypotheses or whether you are testing a hypothesis.
Visualizing your data before deciding on the type of model to fit is necessary if you are doing exploratory work. BUT! If you have a hypothesis that you are testing, you should hold off until after running your hypothesis tests (fitting your models, doing inferential stats, etc). At the very least, you should not adapt your hypothesis after plotting the data.
Why? Every sample is going to have some weirdness associated with it. If you notice an interesting relationship in a sample and you want to test if that relationship can be generalized to the whole population, you can’t test on that original sample. There is no way to know if that relationship was just a quirk of the original data set or if it is, in fact, true of the whole population. In (frequentist) hypothesis testing, you are testing how likely it is that your sample data could have occurred by chance alone. If you plot your data and then modify your hypothesis based on what you see, your hypothesis is no longer independent of that particular sample. So, you can’t assess how likely your data is from chance. This increases the chances of false positives (Type I errors).
For a better explanation, check out Ben Bolker’s answer on CrossValidated to the following question: Performing a statistical test after visualizing data - data dredging?. (Ben Bolker is the author of Ecological modeling and Data in R and is a very reliable source on statistical modeling.)
ggplot2Let’s go back over some ggplot concepts in more details. We’ll use
the starwars data set.
data("starwars")
When you load or read in a data set, remember to do a few checks to make sure everything loaded properly
head(starwars)
## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld
The function glimpse in the package dplyr
is really helpful. (Base R str is also good, but I find
that glimpse is a bit easier to read.)
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
## $ films      <list> <"The Empire Strikes Back", "Revenge of the Sith", "Return…
## $ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
## $ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
If you’ve read any of the other documents on the webpage, you’ll remember those last three columns are a bit weird. We don’t really want to have to deal with columns of lists and we don’t need that data any way, so we’ll drop those three columns.
# here we're using dplyr to clean up the data set
starwars <- 
    starwars |>  
    select(-(films:starships))
head(starwars)
## # A tibble: 6 × 11
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 1 more variable: species <chr>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
glimpse(starwars)
## Rows: 87
## Columns: 11
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
The basic syntax of ggplot2 typically requires an
argument for your data, data = and an aesthetic argument
aes(). Within the aes() argument, you will
specify the data that you want to visualize.
ggplot(data = starwars, aes(x = height, y = mass))

You’ll notice that there are axis labels and the ranges match with our data but the data doesn’t appear in the plot. If you subset the data, you’ll notice that the ranges change (the data still doesn’t show).
sw_human_data <- 
  starwars |>  
  filter(species == "Human")
ggplot(data = sw_human_data, aes(x = height, y = mass))

The data does not appear because ggplot2 does not assume
the type of plot you want, so you must specify this information. You can
do this by using the operator + and functions which usually
start with geom_. For example, for scatter plots use
geom_point(), for boxplots use geom_boxplot(),
for a histogram geom_histogram().
ggplot(data = sw_human_data, aes(x = height, y = mass)) + 
  geom_point()
## Warning: Removed 13 rows containing missing values (geom_point).

ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_boxplot()
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

ggplot(data = sw_human_data, aes(x = height)) + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 4 rows containing non-finite values (stat_bin).

(Note: There are NAs in this data set. ggplot2 will
remove them by default but produces a warning message - “removed n rows
containing non-finite values”; “removed n rows containing missing
values”. Be careful, though, ggplot2 will use the same
warning message if it removes data because of NAs or if it removes them
because the axes are mis-specified. If you get this message make sure
you know why it is being produced.)
These plots are pretty basic and do not have titles or useful axis labels, but ggplot2 gives you tons of flexibility in adjusting your plots. Let’s start with some scatter plots.
First, let’s modify the axis and title labels:
ggplot(data = sw_human_data, aes(x = height, y = mass)) + 
  geom_point() +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

We can also modify how the points appear by adjusting the parameters
in geom_point()
ggplot(data = sw_human_data, aes(x = height, y = mass)) + 
  geom_point(size = 5, shape = 17, color = "red") +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

What if we want to look at how height and mass break down by group?
ggplot(data = sw_human_data, aes(x = height, y = mass, color = eye_color)) + 
  geom_point(size = 5) +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

There are two important things to note about this plot.
First, compare how we specified the point colors in this plot vs the
one preceding it. In the first, color is specified in the
geom_point layer. In the second, it’s specified in the
initial layer, as part of the aes function. It obviously
has a pretty significant effect on the plot. The function
aes maps data onto the aesthetic properties of a
layer/geom. If it is specified in the very first layer, the mapping is
used throughout all of the layers (until and unless it is overridden).
If color is specified outside of aes, color is only defined
for that particular layer. Also, it is more of a property of the geom
rather than the data-geom mapping. Color is not the only property that
works like this. You can specify fill, shape,
size, among others as properties of the geom or the
geom-data mapping.
ggplot(data = sw_human_data, aes(x = height, y = mass, shape = eye_color)) + 
  geom_point(size = 5) +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

ggplot(data = sw_human_data, aes(x = height, y = mass, size = eye_color)) + 
  geom_point() +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Using size for a discrete variable is not advised.
## Warning: Removed 13 rows containing missing values (geom_point).

Check your understanding
What do you think would happen if we had left size = 5
in the geom_point() layer? Why do you think that?
The aesthetic mapping can be defined in the initial layer or in a later geom layer. This could be helpful if you have multiple geom layers (e.g., points overlaid over box plots) and you want color to map different in the geom layers.
ggplot(data = sw_human_data, aes(x = height, y = mass)) + 
  geom_point(aes(color = eye_color), size = 5) +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

The second thing you should note about the plot, is that it’s pretty
confusing The colors in the plot don’t match the eye color. While this
probably won’t be a problem you encounter often, specifying colors
manually can be quite useful. Functions which give you more control over
the colors or shapes of points typically start with scale_
(e.g., scale_fill_manual(),
scale_color_continuous(),
scale_shape_discrete()).
R uses hexadecimal to represent colors. You can look up the hex values, but R also has hundreds of built-in color names. You can find those color names here: “http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf”
TMI
Note: No need to worry too much about ‘hexadecimal’. If you are interested, it is a base-16 numerical system. Each color is represented by a hashtag followed by 6 characters; 2 characters representing the amount of red, 2 for green, and 2 for blue.
ggplot(data = sw_human_data, aes(x = height, y = mass, color = eye_color)) + 
  geom_point(size = 5) +
  scale_color_manual(values = c("blue", "lightblue", "tan", "tan4", "forestgreen", "yellow")) +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

You can also change the colors without specifying colors manually.
This is useful if you don’t need to specify exactly which colors go with
with group, but you need a different color palette than the default. By
using a palette you can improve the look of your graphs relatively
easily. Some journals have color preferred color palettes. You can also
make your graphs color-blind friendly. The package
RColorBrewer has created palettes that are compatible with
ggplot2 and contain metadata which let you know if the
palette is color-blind friendly or not.
head(brewer.pal.info, 10)
##          maxcolors category colorblind
## BrBG            11      div       TRUE
## PiYG            11      div       TRUE
## PRGn            11      div       TRUE
## PuOr            11      div       TRUE
## RdBu            11      div       TRUE
## RdGy            11      div      FALSE
## RdYlBu          11      div       TRUE
## RdYlGn          11      div      FALSE
## Spectral        11      div      FALSE
## Accent           8     qual      FALSE
ggplot(data = sw_human_data, aes(x = height, y = mass, color = eye_color)) + 
  geom_point(size = 5) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Human height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 13 rows containing missing values (geom_point).

What happens if you have an outlier?
ggplot(data = starwars, aes(x = height, y = mass)) + 
  geom_point(size = 2) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Height and mass in the Star Wars universe", x = "Height (cm)", y = "Mass (kg)")
## Warning: Removed 28 rows containing missing values (geom_point).

You can change the axes on your plot. However, be careful, some methods could have unintended consequences. Let’s look at an example:
ggplot(data = starwars, aes(x = height, y = mass)) + 
  geom_point(size = 2) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Height and mass in the Star Wars universe", subtitle = "x and y limits changed", x = "Height (cm)", y = "Mass (kg)") +
  ylim(c(0,200)) +
  xlim(c(0,250))
## Warning: Removed 29 rows containing missing values (geom_point).

If we remove the really heavy individual, we can see the height-mass relationship a bit better. We can apply a data smoothing function that will give you a linear regression line with confidence intervals.
ggplot(data = starwars, aes(x = height, y = mass)) + 
  geom_point(size = 2) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Height and mass in the Star Wars universe", subtitle = "x and y limits changed", x = "Height (cm)", y = "Mass (kg)") +
  ylim(c(0,200)) +
  xlim(c(50,250)) +
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 29 rows containing non-finite values (stat_smooth).
## Warning: Removed 29 rows containing missing values (geom_point).

However, the model only applies the linear model calculation to the
data points that actually appear. If we use
coord_cartesian() around xlim() and
ylim(), we’ll see a big difference.
ggplot(data = starwars, aes(x = height, y = mass)) + 
  geom_point(size = 2) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Height and mass in the Star Wars universe", subtitle = "x and y limits changed using coord_cartesian", x = "Height (cm)", y = "Mass (kg)") +
  coord_cartesian(xlim = c(50, 250), 
                  ylim = c(0, 200)) + 
  stat_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 28 rows containing non-finite values (stat_smooth).
## Warning: Removed 28 rows containing missing values (geom_point).

Notice how the slope and intercept are different in the two plots and
the grey area (the confidence interval) is much larger in the second
plot. The function coord_cartesian() calculates based on
the data in the data frame, not the data that is plotted.
coord_cartesian() basically zooms into the data, while
xlim() and ylim() by themselves subset the
data area. You might not use the geom_smooth() function
much, but you will likely use geom_boxplot().
median_human_height_gender <- tapply(sw_human_data$height, sw_human_data$gender, median, na.rm = TRUE)
ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_boxplot() +
  labs(title = "Human height and gender in the Star Wars universe", x = "Gender", y = "Height (cm)") +
  geom_hline(yintercept = median_human_height_gender['feminine'], linetype = 2)+
  geom_hline(yintercept = median_human_height_gender['masculine'], linetype = 3)+
  annotate("text", x = 2, y = median_human_height_gender['feminine'], label = "median female height", angle = 345, hjust = 1, vjust = 5)+
  annotate("text", x = 1, y = median_human_height_gender['masculine'], label = "median male height", angle = 345, hjust = -0.5, vjust = 0)
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

TMI
annotate()
The function annotate allows you to add elements to your plot that may or may not be part of the original data set. This could be text, as in the prior example, or it could be lines (“segment”), rectangles (“rect”), curves (“curve”), etc. This functionality is quite useful if you want to emphasize certain data points or regions of your plot.
yoda_h <- starwars$height[which(starwars$name == "Yoda")]
yoda_m <- starwars$mass[which(starwars$name == "Yoda")]
wookiee <- starwars |> filter(species == "Wookiee") |> 
  select(height, mass)
skywalkers <- starwars |> filter(grepl("Skywalker|Leia", name))
ggplot(data = starwars, aes(x = height, y = mass)) + 
  geom_point(size = 2) +
  scale_colour_brewer(palette = "RdYlBu") +
  labs(title = "Height and mass in the Star Wars universe", subtitle = "x and y limits changed using coord_cartesian", x = "Height (cm)", y = "Mass (kg)") +
  coord_cartesian(xlim = c(50, 250), 
                  ylim = c(0, 200)) + 
  annotate("curve", 
           x = yoda_h, xend = yoda_h, 
           y = yoda_m + 75, yend = yoda_m, 
           arrow = arrow(),
           color = 'green')+
  annotate("text", x = yoda_h, y = yoda_m + 80, label = "Yoda")+
  annotate("rect", 
           xmin = min(wookiee$height) - 5, xmax = max(wookiee$height)+ 5,
           ymin = min(wookiee$mass) - 5, ymax = max(wookiee$mass) + 5,
           alpha = .5,
           fill = 'darkorange4'
           )+
  annotate("text", 
           x = max(wookiee$height), y = max(wookiee$mass)+10, 
           label = "Wookiees") +
  annotate("rect", 
           xmin = 40, xmax = 80,
           ymin = 190, ymax = 210,
           color = 'black', fill = 'grey',
           alpha = 0.6) +
  annotate("point", x = skywalkers$height, y = skywalkers$mass,
           size = 4, color = 'blue', alpha = 0.5)+
  annotate("text", x = 50, y = 200, label = "Skywalkers", hjust = 0)+
  annotate("point", x = 45, y = 200, size = 2, color = 'black')+
  annotate("point", x = 45, y = 200, size = 4, color = 'blue', alpha = 0.5) +
  theme_classic() # using pre-defined themes is a quick and simple way to improve the look of your plots
## Warning: Removed 28 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_boxplot() +
  labs(title = "Human height and gender in the Star Wars universe", subtitle = "x and y limits changed", 
       x = "Gender", y = "Height (cm)") +
  geom_hline(yintercept = median_human_height_gender['feminine'], 
             linetype = 2)+
  geom_hline(yintercept = median_human_height_gender['masculine'], 
             linetype = 3)+
  annotate("text", x = 2, y = median_human_height_gender['feminine'], 
           label = "median female height", 
           angle = 345, hjust = 1, vjust = 5)+
  annotate("text", x = 1, y = median_human_height_gender['masculine'], 
           label = "median male height", angle = 345, hjust = -0.5, vjust = 0)+
  ylim(c(150, 190))
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_boxplot() +
  labs(title = "Human height and gender in the Star Wars universe", subtitle = "x and y limits changed using coord_cartesian", x = "Gender", y = "Height (cm)")  +
  geom_hline(yintercept = median_human_height_gender['feminine'], 
             linetype = 2)+
  geom_hline(yintercept = median_human_height_gender['masculine'], 
             linetype = 3) +
  annotate("text", x = 2, y = median_human_height_gender['feminine'], 
           label = "median female height", angle = 345, hjust = 1, vjust = 5)+
  annotate("text", x = 1, y = median_human_height_gender['masculine'], 
           label = "median male height", angle = 345, hjust = -0.5, vjust = 0)+
  coord_cartesian(ylim = c(150, 190))
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

When we manually set the y axis limits without using
coord_cartesian, the median height of masculine characters
differs by about 3 cm.
ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_boxplot() +
  labs(title = "Human height by gender in the Star Wars universe",x = "Gender", y = "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

One of the benefits of ggplot2 is that you can easily (once you get used to the ggplot2 syntax) combine different types of visualizations. You can overlay a violin plot with box plots while plotting the individual data points all in one plot
ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_violin() +
  geom_boxplot(width = 0.25) +
  geom_jitter(size = 3, width = 0.25) + # geom_jitter in geom_point() but with the points "jittered" 
  labs(title = "Human height by gender in the Star Wars universe",x = "Gender", y = "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing missing values (geom_point).

You can continue to add more groups or variables for more complex plots.
ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_violin() +
  geom_boxplot(width = 0.25) +
  geom_jitter(size = 3, width = 0.25, aes(color = homeworld)) + # geom_jitter in geom_point() but with the points "jittered" 
  labs(title = "Human height by gender in the Star Wars universe",x = "Gender", y = "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing missing values (geom_point).

ggplot(data = sw_human_data, aes(x = gender, y = height)) +
  geom_violin() +
  geom_boxplot(width = 0.25) +
  geom_jitter(size = 3, width = 0.25, aes(color = mass)) + # geom_jitter in geom_point() but with the points "jittered" 
  labs(title = "Human height by gender in the Star Wars universe",x = "Gender", y = "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_ydensity).
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing missing values (geom_point).

ggplot(data = sw_human_data, aes(x = height)) + 
  geom_histogram(binwidth = 10) +
  labs(title = "Human height histogram in the Star Wars universe",x =  "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_bin).

Different binwidths have an effect on the shape of the histogram.
ggplot(data = sw_human_data, aes(x = height)) + 
  geom_histogram(binwidth = 10) +
  geom_histogram(binwidth = 3, fill = "red") +
  labs(title = "Human height histogram in the Star Wars universe",x =  "Height (cm)")
## Warning: Removed 4 rows containing non-finite values (stat_bin).
## Removed 4 rows containing non-finite values (stat_bin).

sw_nonhuman_data <-
starwars %>%
  mutate(human = ifelse(species == "Human", "Human", "Non-Human"))
  
ggplot(data = sw_nonhuman_data, aes(x = height, fill = human)) +
  geom_histogram(binwidth = 10) +
  labs(title = "Height histogram in the Star Wars universe", subtitle = "Human vs Non-Human", x =  "Height (cm)")
## Warning: Removed 6 rows containing non-finite values (stat_bin).

In this final plot we’ll connect points and we’ll deal with object type problems.
We’ll simulate data so that we have multiple measurements for an individual.
data <-
data.frame(ID = rep(c("a", "b", "c", "d"), times = 2),
           week = rep(c(1, 2), each = 4),
           measure = c(rnorm(4, 5, 2), 
                       rnorm(4, 10, 2)))
glimpse(data)
## Rows: 8
## Columns: 3
## $ ID      <chr> "a", "b", "c", "d", "a", "b", "c", "d"
## $ week    <dbl> 1, 1, 1, 1, 2, 2, 2, 2
## $ measure <dbl> 5.151549, 2.834430, 4.809382, 5.461834, 7.981721, 10.447433, 1…
First, let’s plot the data to see if there is a difference between weeks.
ggplot(data = data, aes(x = week, y = measure)) +
  geom_boxplot() +
  geom_point()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

That seems wrong. What happened?
R interprets “week” as a continuous variable. We want to treat it as a factor.
ggplot(data = data, aes(x = as.factor(week), y = measure)) +
  geom_boxplot() +
  geom_point()

That’s better!
Now, let’s connect the points by subject ID.
ggplot(data = data, aes(x = as.factor(week), y = measure)) +
  geom_boxplot() +
  geom_point() +
  geom_path(aes(group = ID))

Check your understanding
I have replicated an infamous plot published in a Business Insider article using data from the Florida Department of Law Enforcement. The article discusses a law passed in Florida that relates to the use of deadly force in self-defense situations. “Stand-your-ground” laws are somewhat controversial so if you’re interested in the topic, there’s a lot out there. For this lesson, however, we’re really only interested in the details of how the data is visualized.
dat <- read.csv("../data/FL_Murder_by_Weapon_1995-2020.csv")
head(dat)
##   Year Total.Offenses Firearm Knife..Cutting.Instrument Hands..Fist..Feet Other
## 1 1995           1030     687                       135                50   158
## 2 1996           1077     668                       146                85   178
## 3 1997           1014     634                       129                87   164
## 4 1998            966     589                       115                86   176
## 5 1999            856     460                       132                98   166
## 6 2000            890     499                       131                83   177
glimpse(dat)
## Rows: 26
## Columns: 6
## $ Year                      <int> 1995, 1996, 1997, 1998, 1999, 2000, 2001, 20…
## $ Total.Offenses            <int> 1030, 1077, 1014, 966, 856, 890, 867, 906, 9…
## $ Firearm                   <int> 687, 668, 634, 589, 460, 499, 502, 552, 586,…
## $ Knife..Cutting.Instrument <int> 135, 146, 129, 115, 132, 131, 108, 114, 121,…
## $ Hands..Fist..Feet         <int> 50, 85, 87, 86, 98, 83, 91, 82, 77, 94, 75, …
## $ Other                     <int> 158, 178, 164, 176, 166, 177, 166, 158, 140,…
ggplot(data = dat, aes(x = Year, y = Firearm))+
  geom_area(fill = 'darkred', alpha = 0.8)+
  geom_line() +
  scale_y_reverse(limits = c(1200, 0), 
                  breaks = seq(1200, 0, -200), 
                  labels = seq(1200, 0, -200))+
  scale_x_continuous(breaks = c(2000, 2010, 2020), labels = c("1990s", "2000s", "2010s"))+
  ylab(label = element_blank())+
  xlab(label = element_blank())+
  labs(caption = "Recreation of a plot published in a Business Insider article from FEB 18, 2014.\nData is from Florida Department of Law Enforcement.\nOriginal plot showed data from 1990-2012, current plot shows data from 1995-2020.")+
  ggtitle("Gun deaths in Florida", 
          subtitle = "Number of murders committed using firearms")+
  annotate(geom = "text", 
           x = dat$Year[1], y = dat$Firearm[1], 
           vjust = 2,
           label = as.character(dat$Firearm[1]))+
  annotate(geom = "text", 
           x = tail(dat$Year, n = 1), y = tail(dat$Firearm, n = 1), 
           vjust = 2,
           label = as.character(tail(dat$Firearm, n = 1)))+
  annotate(geom = "segment",
           x = 2005, xend = 2005,
           y = dat$Firearm[which(dat$Year == 2005)], 
           yend = dat$Firearm[which(dat$Year == 2005)] - 100,
           color = 'white'
           )+
  annotate(geom = "text",
           x = 2005,
           y = dat$Firearm[which(dat$Year == 2005)] - 250,
           label = "2005\nFlorida enacted \nits 'Stand Your \nGround' law",
           color = 'white',
           hjust = 0,
           )+
  geom_point(size = 4, color = 'white', fill = 'black', shape = 21)+
  theme(
    panel.background = element_rect(fill = "white", color = "white"),
    plot.title = element_text(size = 24, face = "bold"),
    axis.text.x = element_text(hjust = 3, vjust = 5),
    axis.ticks.length.x = unit(-0.4, "cm"),
    panel.grid.major.y = element_line(color = "grey"),
    panel.grid.major.x = element_blank(),
    plot.caption = element_text(hjust = 0)
    )

Before looking in detail at the code or the plot, what are your initial impressions of the plot? What does it do well? What could be done better? What does the plot tell you?
This plot makes (at least) one pretty big mistake. What is it? Why is it a mistake?
An important step in writing code is learning to read code. Take
a look at the code above. While it may be quite daunting, you don’t need
to understand everything to be able to gain crucial information from the
code. What does the line geom_line() do? Remember that you
can check the documentation of a function by using ? before
the function name. (Google is also your friend in this case as there are
tons of tutorials, cheat
sheets, and vignettes
that can explain things in more detail.)
You should always read the documentation, but not all documentation is created equal. Writing documentation is not easy and many programmers often treat it as an afterthought. But, you can approach coding like a scientist!
Form a hypothesis: What do you think the function does?
Design an experiment: How can you change the code to see if the
function acts the way you think it does?
Run an experiment: Run the code/program
Analyze the results: Is the output what you expected? If not…
Update your hypothesis and/or experimental design and try again!
What does the function scale_x_continuous do,
specifically, what does it do in this plot? What about the parameters
breaks and labels? Formulate a “hypothesis”
and design an “experiment” to test the function
scale_x_continuous.
We’re going to go back to the starwars data set. We’re
going to look at heights again; this time we’re going to compare heights
across the different home worlds in the star wars universe.
Unfortunately, a lot of the home worlds in the data set only have one or two characters. We’ll first filter the data so that we drop all of the home worlds that have two or fewer characters.
worlds <- starwars |>
  group_by(homeworld) |> 
  filter(n() >= 3) |> 
  drop_na(homeworld) # drops the observations where the home world is NA
head(worlds)
## # A tibble: 6 × 11
## # Groups:   homeworld [3]
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 1 more variable: species <chr>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
ggplot(data = worlds, aes(x = homeworld, height))+
  geom_boxplot() +   
  labs(title = "Height by home world in the Star Wars universe",
       y =  "Height (cm)",
       x = "Home world")+
  theme_bw() 

Do you feel like box plots are a good representation of these data? Why or why not?
There is at least one aspect of the data that this plot does not represent well. Add one or more layers to the plot to make the visualization less misleading. (You can directly copy the code from above. You can remove layers if you want, but you don’t have to.)
### YOUR CODE HERE
#
#
#
#
#
For this example, we’re going to look at penguin data. These data can
be found in the R package palmerpenguin.
data("penguins")
head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## 6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
## # … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g
glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins <- penguins |> na.omit()
glimpse(penguins)
## Rows: 333
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex               <fct> male, female, female, female, male, female, male, fe…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
ggplot(data = penguins, aes(x = bill_length_mm, y = bill_depth_mm))+
  geom_point() +
   labs(x="Length (mm)", y="Depth (mm)", title="Penguin Bill Depth by Bill Length")+
  theme_bw()

Do you think there is a relationship between Penguin bill length and bill depth? If so, do you think it’s a positive or negative relationship based on the plot?
Take a look at the rest of the columns in the data set.
colnames(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
Do you think there are any other variables that might help you better understand the relationship between bill length and bill depth?
### YOUR CODE HERE
#
#
#
#
#
ggplot2 gives you a lot of power and flexibility to make
really nice visualizations. It can’t do everything though. There are
packages, however, that build on ggplot2. You can even
create interactive 3D plots and animations. It does take a bit of time
to get used to the syntax and you will spend a lot of time on Google
(even for things you’ve done multiple times). But once you get familiar
with it, you’ll have a lot of fun creating beautiful plots!
================================================================================
Last update on 2022-01-03
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] palmerpenguins_0.1.1 RColorBrewer_1.1-3   tidyr_1.2.0         
## [4] dplyr_1.0.10         ggplot2_3.3.6       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2  xfun_0.32         bslib_0.4.0       purrr_0.3.4      
##  [5] splines_4.2.2     lattice_0.20-45   colorspace_2.0-3  vctrs_0.4.1      
##  [9] generics_0.1.3    htmltools_0.5.3   yaml_2.3.5        mgcv_1.8-41      
## [13] utf8_1.2.2        rlang_1.0.5       jquerylib_0.1.4   pillar_1.8.1     
## [17] glue_1.6.2        withr_2.5.0       DBI_1.1.2         lifecycle_1.0.1  
## [21] stringr_1.4.1     munsell_0.5.0     gtable_0.3.0      fontawesome_0.4.0
## [25] evaluate_0.16     labeling_0.4.2    knitr_1.40        fastmap_1.1.0    
## [29] fansi_1.0.3       highr_0.9         scales_1.2.0      cachem_1.0.6     
## [33] jsonlite_1.8.0    farver_2.1.0      digest_0.6.29     stringi_1.7.8    
## [37] grid_4.2.2        cli_3.3.0         tools_4.2.2       magrittr_2.0.3   
## [41] sass_0.4.2        tibble_3.1.8      pkgconfig_2.0.3   Matrix_1.4-1     
## [45] ellipsis_0.3.2    assertthat_0.2.1  rmarkdown_2.16    rstudioapi_0.14  
## [49] R6_2.5.1          nlme_3.1-160      compiler_4.2.2
================================================================================
Copyright © 2022 Dan C. Mann. All rights reserved.