In the last lesson, we covered how to use read.table()
and read.csv()
to import your data into R. Even if you can
get your data into R, you are likely to encounter problems. A lot of
these problems can be solved by making sure that your variables are in
the correct format. But, there are tons of other issues that can arise
and may not even be obvious a first glance. You can save yourself a ton
of headaches by formatting your data properly from the onset. Follow
these rules to make your life easier:
Each variable has its own column
The first row contains variable names
Every variable has a name
Each observation has its own row
Each value has its own cell
Don’t leave blank cells
In R, NA
is the default value for missing
Be consistent
Values that are the same, should be entered the same (e.g., Don’t
use Y
and Yes
, or Austria
Be consistent with variable names (e.g., don’t do
, pedal length
, Sepal_Width
When you install R, you also install a package that contains multiple
data sets. These data sets can be used for practice and you will often
see tutorials use one or more of these. In R Studio, if you use the
function data()
(with no specified argument) a new tab will
open up in your Source pane.
If you look at the tab, at the top you’ll see “Data sets in package ‘datasets’:” This is a “base” R package. Base R packages come with the installation of R and often are essential to the basic functioning of R as a language. Take a look at the available data sets and their descriptions. We’ll load one now.
There are a couple of data sets that you will see very often in
tutorials. Among the most popular are mtcars, iris,
ToothGrowth, and USArrests. To find out more
information on a data set, use the ?
function with the name
of the data set in the console (e.g., ?mtcars
). This will
bring up the documentation which includes a description of the data.
To load one of these data sets, use the same data()
function and put the name of the data set as the first argument. (The
name of the data set can be in quotes or without
or data(iris)
## You may see <promise> in your R environment window. This is normal, once you run a line of code using the data R will load it.
When you load data, R’s pre-installed or your own, you should always
check it to verify that what is in your library is what you expected.
The functions str()
and head()
are useful for
prints the structure of the data. Here you can see
the number and names of columns (aka variables), the number of
observations, and the data type of each column.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
In this data set, we have 150 observations taken from three species of flowers. For each individual flower there are four measurements.
prints the first 6 rows (observations) of the
data frame (by default, you can change the number of rows to show by
specifying the n =
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The rm()
function can help you clean up your environment
pane. rm(variable_name)
will remove the variable from your
global environment. It functions similarly for Data, Values, or
Functions. To remove everything from your environment use
rm(list = ls())
. The ls()
function prints all
of your current variables, so you’re giving the rm()
function a full list of your variable names. Let’s remove the iris data
set from our (current) environment.
You shouldn’t see iris under “Data” in your global environment pane
anymore. If you accidentally removed the data set you can still reload
iris using data()
, but any changes you made to the
data set will be lost. Similarly, if you create a variable and them
remove it with rm()
, the variable is gone which is why it
is important to work in scripts and save often.
Now that we have our data in R, let’s see what we can find out! For
this, we’ll use the package dplyr
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## filter, lag
## The following objects are masked from 'package:base':
## intersect, setdiff, setequal, union
## dplyr uses objects called "tibbles". These are essentially dataframes.
Many, if not most, R packages include data sets that can be used for
practicing. Run the data()
function again.
You’ll see the same information as before, but if you scroll down, you’ll see “Data sets in package ‘dplyr’”. When you load a package, you also load the package’s data sets. Let’s load the starwars data set.
## # A tibble: 6 × 14
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
## 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 5 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 6 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## # ³eye_color, ⁴birth_year, ⁵homeworld
str(starwars[1:11]) #we'll print only the first 11 rows. The last three columns are lists.
## tibble [87 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
## $ mass : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
## $ hair_color: chr [1:87] "blond" NA NA "none" ...
## $ skin_color: chr [1:87] "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr [1:87] "blue" "yellow" "red" "yellow" ...
## $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
## $ sex : chr [1:87] "male" "none" "none" "male" ...
## $ gender : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr [1:87] "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr [1:87] "Human" "Droid" "Droid" "Human" ...
At first dplyr
may seem a little strange, however, as
you get familiar with it you will notice it’s pretty intuitive and makes
data organization a lot easier. dplyr
allows you to filter,
rearrange, modify, and summarize your data quickly and (relatively)
painlessly. We will learn the following verbs:
The great thing about these verbs is that they can be combined with
the operator %>%
so that you can perform multiple
operations at once.
allows you to choose a column or multiple
columns. The first argument (if you don’t use %>%
should be the dataframe, the second should be the column name. You don’t
need to put quotes (““) around the column name in
select(starwars, name)
## # A tibble: 87 × 1
## name
## <chr>
## 1 Luke Skywalker
## 2 C-3PO
## 3 R2-D2
## 4 Darth Vader
## 5 Leia Organa
## 6 Owen Lars
## 7 Beru Whitesun lars
## 8 R5-D4
## 9 Biggs Darklighter
## 10 Obi-Wan Kenobi
## # … with 77 more rows
You can include as many column names as arguments to select multiple
columns (and you don’t need to use c()
select(starwars, name, species)
## # A tibble: 87 × 2
## name species
## <chr> <chr>
## 1 Luke Skywalker Human
## 2 C-3PO Droid
## 3 R2-D2 Droid
## 4 Darth Vader Human
## 5 Leia Organa Human
## 6 Owen Lars Human
## 7 Beru Whitesun lars Human
## 8 R5-D4 Droid
## 9 Biggs Darklighter Human
## 10 Obi-Wan Kenobi Human
## # … with 77 more rows
We can also write the code using the pipe operator
. For simple examples you don’t need to use them but
as your code gets more complex the pipe operators will make your code
easier to understand. If you use %>%
, then the first
argument that you explicitly write is the first column name you want to
starwars %>%
select(name, species)
## # A tibble: 87 × 2
## name species
## <chr> <chr>
## 1 Luke Skywalker Human
## 2 C-3PO Droid
## 3 R2-D2 Droid
## 4 Darth Vader Human
## 5 Leia Organa Human
## 6 Owen Lars Human
## 7 Beru Whitesun lars Human
## 8 R5-D4 Droid
## 9 Biggs Darklighter Human
## 10 Obi-Wan Kenobi Human
## # … with 77 more rows
The order the function outputs will be the same as what you put into
, not the original order of the data frame.
starwars %>%
select(species, name)
## # A tibble: 87 × 2
## species name
## <chr> <chr>
## 1 Human Luke Skywalker
## 2 Droid C-3PO
## 3 Droid R2-D2
## 4 Human Darth Vader
## 5 Human Leia Organa
## 6 Human Owen Lars
## 7 Human Beru Whitesun lars
## 8 Droid R5-D4
## 9 Human Biggs Darklighter
## 10 Human Obi-Wan Kenobi
## # … with 77 more rows
You can get fancy by using other functions within
like starts_with()
starwars %>%
## # A tibble: 87 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 <NA> gold yellow
## 3 <NA> white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
## 7 brown light blue
## 8 <NA> white, red red
## 9 black light brown
## 10 auburn, white fair blue-gray
## # … with 77 more rows
Or by using select_if()
starwars %>%
## # A tibble: 87 × 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
## 7 165 75 47
## 8 97 32 NA
## 9 183 84 24
## 10 182 77 57
## # … with 77 more rows
To drop a column use -
and to select a range of columns
use :
starwars %>%
## # A tibble: 87 × 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # … with 77 more rows
To store the results of the filtering, assign the output to a new variable or the same variable if you want to override it. We’ll drop the three final columns and store it with the same variable name.
starwars <-
starwars %>%
## tibble [87 × 11] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
## $ height : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
## $ mass : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
## $ hair_color: chr [1:87] "blond" NA NA "none" ...
## $ skin_color: chr [1:87] "fair" "gold" "white, blue" "white" ...
## $ eye_color : chr [1:87] "blue" "yellow" "red" "yellow" ...
## $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
## $ sex : chr [1:87] "male" "none" "none" "male" ...
## $ gender : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...
## $ homeworld : chr [1:87] "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
## $ species : chr [1:87] "Human" "Droid" "Droid" "Human" ...
To subset observations based on a condition use
. As with select()
you can put the
data frame inside the function or use %>%
filter(starwars, species == "Human")
## # A tibble: 35 × 11
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 3 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 4 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 5 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 6 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 7 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## 8 Anakin Sky… 188 84 blond fair blue 41.9 male mascu… Tatooi…
## 9 Wilhuff Ta… 180 NA auburn… fair blue 64 male mascu… Eriadu
## 10 Han Solo 180 80 brown fair brown 29 male mascu… Corell…
## # … with 25 more rows, 1 more variable: species <chr>, and abbreviated variable
## # names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
starwars %>%
filter(species == "Human")
## # A tibble: 35 × 11
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 3 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 4 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 5 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 6 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 7 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## 8 Anakin Sky… 188 84 blond fair blue 41.9 male mascu… Tatooi…
## 9 Wilhuff Ta… 180 NA auburn… fair blue 64 male mascu… Eriadu
## 10 Han Solo 180 80 brown fair brown 29 male mascu… Corell…
## # … with 25 more rows, 1 more variable: species <chr>, and abbreviated variable
## # names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
uses conditional expressions to subset the data
frame. A conditional expression tests a value against a condition and
returns either TRUE
. For example, the
conditional expression 1 == 2
can be read as “is the value
1 equal to the value 2?”. The result of that test is FALSE
1 <= 2
can be read “is 1 less than or equal to 2”, which
should return TRUE
## [1] FALSE
1 <= 2
## [1] TRUE
As we saw in an earlier lesson, R can apply this evaluation for an entire vector.
x <- c(1, 2, 4, 8, 16)
x < 8
R evaluates each member of the vector and returns TRUE
applies a conditional expression to a data
frame column and keeps those rows that evaluate to
starwars %>%
filter(mass > 100)
## # A tibble: 10 × 11
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 2 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 3 Chewbacca 228 112 brown unknown blue 200 male mascu… Kashyy…
## 4 Jabba Desi… 175 1358 <NA> green-… orange 600 herm… mascu… Nal Hu…
## 5 Jek Tono P… 180 110 brown fair blue NA male mascu… Bestin…
## 6 IG-88 200 140 none metal red 15 none mascu… <NA>
## 7 Bossk 190 113 none green red 53 male mascu… Trando…
## 8 Dexter Jet… 198 102 none brown yellow NA male mascu… Ojom
## 9 Grievous 216 159 none brown,… green,… NA male mascu… Kalee
## 10 Tarfful 234 136 brown brown blue NA male mascu… Kashyy…
## # … with 1 more variable: species <chr>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
uses the following relational operators:
: equal to!=
: not equal to<
: less than>
: greater than<=
: less than or equal to>=
: greater than or equal toYou can also use !
, &
, and
to combine conditions. Respectively, these mean NOT, AND,
OR. filter(eye_color == 'yellow' & species == "Human")
returns all those individuals with yellow eyes and who
are human
starwars %>%
filter(eye_color == 'yellow' & species == "Human")
## # A tibble: 2 × 11
## name height mass hair_c…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 2 Palpatine 170 75 grey pale yellow 82 male mascu… Naboo
## # … with 1 more variable: species <chr>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
filter(eye_color == 'yellow' & !species == "Human")
means “individuals where eye_color is equal to ‘yellow’ AND species is
NOT equal to ‘human’.
starwars %>%
filter(eye_color == 'yellow' & !species == "Human")
## # A tibble: 9 × 11
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 2 Watto 137 NA black blue, … yellow NA male mascu… Toydar…
## 3 Darth Maul 175 80 none red yellow 54 male mascu… Dathom…
## 4 Dud Bolt 94 45 none blue, … yellow NA male mascu… Vulpter
## 5 Ki-Adi-Mundi 198 82 white pale yellow 92 male mascu… Cerea
## 6 Yarael Poof 264 NA none white yellow NA male mascu… Quermia
## 7 Poggle the … 183 80 none green yellow NA male mascu… Geonos…
## 8 Zam Wesell 168 55 blonde fair, … yellow NA fema… femin… Zolan
## 9 Dexter Jett… 198 102 none brown yellow NA male mascu… Ojom
## # … with 1 more variable: species <chr>, and abbreviated variable names
## # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
filter(eye_color == 'yellow' | species == "Human")
returns those individuals for which eye color is equal to yellow,
independent of species, OR those who are humans, independent of eye
starwars %>%
filter(eye_color == 'yellow' | species == "Human")
## # A tibble: 44 × 11
## name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 Luke Skywa… 172 77 blond fair blue 19 male mascu… Tatooi…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
## 3 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
## 4 Leia Organa 150 49 brown light brown 19 fema… femin… Aldera…
## 5 Owen Lars 178 120 brown,… light blue 52 male mascu… Tatooi…
## 6 Beru White… 165 75 brown light blue 47 fema… femin… Tatooi…
## 7 Biggs Dark… 183 84 black light brown 24 male mascu… Tatooi…
## 8 Obi-Wan Ke… 182 77 auburn… fair blue-g… 57 male mascu… Stewjon
## 9 Anakin Sky… 188 84 blond fair blue 41.9 male mascu… Tatooi…
## 10 Wilhuff Ta… 180 NA auburn… fair blue 64 male mascu… Eriadu
## # … with 34 more rows, 1 more variable: species <chr>, and abbreviated variable
## # names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
Combing this verb with some of R’s base functions allows you to easily get descriptive statistics on your data.
starwars %>%
filter(birth_year > 20) %>%
summarise(mean_height = mean(height))
## # A tibble: 1 × 1
## mean_height
## <dbl>
## 1 175.
starwars %>%
select(species) %>%
filter(species != "Human") %>%
summarise(count_nonhuman = n())
## # A tibble: 1 × 1
## count_nonhuman
## <int>
## 1 48
starwars %>%
summarise(mn_height = mean(height, na.rm = T),
sd_height = sd(height, na.rm = T),
min_height = min(height, na.rm = T),
max_height = max(height, na.rm = T))
## # A tibble: 1 × 4
## mn_height sd_height min_height max_height
## <dbl> <dbl> <int> <int>
## 1 174. 34.8 66 264
You may have noticed that summarise()
produces a data
frame as its output. If you only need one value this may not be all that
useful to you. However, your data is often more complex and you would
like to know if there are differences between groups.
starwars %>%
group_by(species) %>%
count = n(),
mn_mass = mean(mass, na.rm = T))
## # A tibble: 38 × 3
## species count mn_mass
## <chr> <int> <dbl>
## 1 Aleena 1 15
## 2 Besalisk 1 102
## 3 Cerean 1 82
## 4 Chagrian 1 NaN
## 5 Clawdite 1 55
## 6 Droid 6 69.8
## 7 Dug 1 40
## 8 Ewok 1 20
## 9 Geonosian 1 80
## 10 Gungan 3 74
## # … with 28 more rows
The output of dplyr
functions are data frame, so you can
actually create a date frame and filter all within the same chain. Here
we’ll summarize by species then filter the summary output by the count
starwars %>%
group_by(species) %>%
count = n(),
mn_mass = mean(mass, na.rm = T)) %>%
filter(count > 5)
## # A tibble: 2 × 3
## species count mn_mass
## <chr> <int> <dbl>
## 1 Droid 6 69.8
## 2 Human 35 82.8
Often you will need to modify a variable for some reason. Maybe you
need to scale the data or log transform it. dplyr
this pretty easy. mutate()
will add a new column to the end
of the data frame and won’t override the original data.
starwars %>%
select(name:mass) %>%
mutate(height_meter = height/100)
## # A tibble: 87 × 4
## name height mass height_meter
## <chr> <int> <dbl> <dbl>
## 1 Luke Skywalker 172 77 1.72
## 2 C-3PO 167 75 1.67
## 3 R2-D2 96 32 0.96
## 4 Darth Vader 202 136 2.02
## 5 Leia Organa 150 49 1.5
## 6 Owen Lars 178 120 1.78
## 7 Beru Whitesun lars 165 75 1.65
## 8 R5-D4 97 32 0.97
## 9 Biggs Darklighter 183 84 1.83
## 10 Obi-Wan Kenobi 182 77 1.82
## # … with 77 more rows
As with all of these, remember that if you don’t assign the output to a variable whatever you do won’t be stored.
There is a lot more you can do with dplyr
, so I would
recommend checking out the documentation. The book R for Data Science
by Garrett Grolemund and Hadley Wickham has a chapter with a lot of
useful information.
