Tidy data

In the last lesson, we covered how to use read.table() and read.csv() to import your data into R. Even if you can get your data into R, you are likely to encounter problems. A lot of these problems can be solved by making sure that your variables are in the correct format. But, there are tons of other issues that can arise and may not even be obvious a first glance. You can save yourself a ton of headaches by formatting your data properly from the onset. Follow these rules to make your life easier:

Each variable has its own column
- The first row contains variable names
- Every variable has a name
Each observation has its own row
Each value has its own cell
- Don’t leave blank cells
- In R, NA is the default value for missing data
Be consistent
- Values that are the same, should be entered the same (e.g., Don’t use Y and Yes, or Austria and austria)
- Be consistent with variable names (e.g., don’t do pedal.width, pedal length, sepal_length, Sepal_Width)

Data sets in R

When you install R, you also install a package that contains multiple data sets. These data sets can be used for practice and you will often see tutorials use one or more of these. In R Studio, if you use the function data() (with no specified argument) a new tab will open up in your Source pane.

data()

If you look at the tab, at the top you’ll see “Data sets in package ‘datasets’:” This is a “base” R package. Base R packages come with the installation of R and often are essential to the basic functioning of R as a language. Take a look at the available data sets and their descriptions. We’ll load one now.

Loading a prexisting data set

There are a couple of data sets that you will see very often in tutorials. Among the most popular are mtcars, iris, ToothGrowth, and USArrests. To find out more information on a data set, use the ? function with the name of the data set in the console (e.g., ?mtcars). This will bring up the documentation which includes a description of the data.

To load one of these data sets, use the same data() function and put the name of the data set as the first argument. (The name of the data set can be in quotes or without data('iris') or data(iris)).

data('iris')
## You may see <promise> in your R environment window. This is normal, once you run a line of code using the data R will load it.

When you load data, R’s pre-installed or your own, you should always check it to verify that what is in your library is what you expected. The functions str() and head() are useful for checking.

str() prints the structure of the data. Here you can see the number and names of columns (aka variables), the number of observations, and the data type of each column.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

In this data set, we have 150 observations taken from three species of flowers. For each individual flower there are four measurements.

head() prints the first 6 rows (observations) of the data frame (by default, you can change the number of rows to show by specifying the n = argument).

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The rm() function can help you clean up your environment pane. rm(variable_name) will remove the variable from your global environment. It functions similarly for Data, Values, or Functions. To remove everything from your environment use rm(list = ls()). The ls() function prints all of your current variables, so you’re giving the rm() function a full list of your variable names. Let’s remove the iris data set from our (current) environment.

rm(iris)

You shouldn’t see iris under “Data” in your global environment pane anymore. If you accidentally removed the data set you can still reload iris using data(), but any changes you made to the data set will be lost. Similarly, if you create a variable and them remove it with rm(), the variable is gone which is why it is important to work in scripts and save often.

`dplyr`

Now that we have our data in R, let’s see what we can find out! For this, we’ll use the package dplyr

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## dplyr uses objects called "tibbles". These are essentially dataframes. 
#?tibble()

Many, if not most, R packages include data sets that can be used for practicing. Run the data() function again.

data()

You’ll see the same information as before, but if you scroll down, you’ll see “Data sets in package ‘dplyr’”. When you load a package, you also load the package’s data sets. Let’s load the starwars data set.

data("starwars")

head(starwars)

## # A tibble: 6 × 14
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Luke Skywal…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
## 2 C-3PO           167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
## 3 R2-D2            96    32 <NA>    white,… red        33   none  mascu… Naboo  
## 4 Darth Vader     202   136 none    white   yellow     41.9 male  mascu… Tatooi…
## 5 Leia Organa     150    49 brown   light   brown      19   fema… femin… Aldera…
## 6 Owen Lars       178   120 brown,… light   blue       52   male  mascu… Tatooi…
## # … with 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>, and abbreviated variable names ¹hair_color, ²skin_color,
## #   ³eye_color, ⁴birth_year, ⁵homeworld

str(starwars[1:11]) #we'll print only the first 11 rows. The last three columns are lists.

## tibble [87 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name      : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
##  $ height    : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:87] "blond" NA NA "none" ...
##  $ skin_color: chr [1:87] "fair" "gold" "white, blue" "white" ...
##  $ eye_color : chr [1:87] "blue" "yellow" "red" "yellow" ...
##  $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:87] "male" "none" "none" "male" ...
##  $ gender    : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...
##  $ homeworld : chr [1:87] "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
##  $ species   : chr [1:87] "Human" "Droid" "Droid" "Human" ...

At first dplyr may seem a little strange, however, as you get familiar with it you will notice it’s pretty intuitive and makes data organization a lot easier. dplyr allows you to filter, rearrange, modify, and summarize your data quickly and (relatively) painlessly. We will learn the following verbs:

select()
- allows you to subset columns (variables)
filter()
- allows you subset rows (observations)
group_by()
- allows you to group observations
summarise()
- allows you summarize your data
mutate()
- allows you to modify your data

The great thing about these verbs is that they can be combined with the operator %>% so that you can perform multiple operations at once.

`select()`

select() allows you to choose a column or multiple columns. The first argument (if you don’t use %>%) should be the dataframe, the second should be the column name. You don’t need to put quotes (““) around the column name in dplyr.

select(starwars, name)

## # A tibble: 87 × 1
##    name              
##    <chr>             
##  1 Luke Skywalker    
##  2 C-3PO             
##  3 R2-D2             
##  4 Darth Vader       
##  5 Leia Organa       
##  6 Owen Lars         
##  7 Beru Whitesun lars
##  8 R5-D4             
##  9 Biggs Darklighter 
## 10 Obi-Wan Kenobi    
## # … with 77 more rows

You can include as many column names as arguments to select multiple columns (and you don’t need to use c()).

select(starwars, name, species)

## # A tibble: 87 × 2
##    name               species
##    <chr>              <chr>  
##  1 Luke Skywalker     Human  
##  2 C-3PO              Droid  
##  3 R2-D2              Droid  
##  4 Darth Vader        Human  
##  5 Leia Organa        Human  
##  6 Owen Lars          Human  
##  7 Beru Whitesun lars Human  
##  8 R5-D4              Droid  
##  9 Biggs Darklighter  Human  
## 10 Obi-Wan Kenobi     Human  
## # … with 77 more rows

We can also write the code using the pipe operator %>%. For simple examples you don’t need to use them but as your code gets more complex the pipe operators will make your code easier to understand. If you use %>%, then the first argument that you explicitly write is the first column name you want to select.

starwars %>% 
    select(name, species)

## # A tibble: 87 × 2
##    name               species
##    <chr>              <chr>  
##  1 Luke Skywalker     Human  
##  2 C-3PO              Droid  
##  3 R2-D2              Droid  
##  4 Darth Vader        Human  
##  5 Leia Organa        Human  
##  6 Owen Lars          Human  
##  7 Beru Whitesun lars Human  
##  8 R5-D4              Droid  
##  9 Biggs Darklighter  Human  
## 10 Obi-Wan Kenobi     Human  
## # … with 77 more rows

The order the function outputs will be the same as what you put into select(), not the original order of the data frame.

starwars %>% 
    select(species, name)

## # A tibble: 87 × 2
##    species name              
##    <chr>   <chr>             
##  1 Human   Luke Skywalker    
##  2 Droid   C-3PO             
##  3 Droid   R2-D2             
##  4 Human   Darth Vader       
##  5 Human   Leia Organa       
##  6 Human   Owen Lars         
##  7 Human   Beru Whitesun lars
##  8 Droid   R5-D4             
##  9 Human   Biggs Darklighter 
## 10 Human   Obi-Wan Kenobi    
## # … with 77 more rows

You can get fancy by using other functions within select() like starts_with() or contains().

starwars %>% 
    select(contains("color"))

## # A tibble: 87 × 3
##    hair_color    skin_color  eye_color
##    <chr>         <chr>       <chr>    
##  1 blond         fair        blue     
##  2 <NA>          gold        yellow   
##  3 <NA>          white, blue red      
##  4 none          white       yellow   
##  5 brown         light       brown    
##  6 brown, grey   light       blue     
##  7 brown         light       blue     
##  8 <NA>          white, red  red      
##  9 black         light       brown    
## 10 auburn, white fair        blue-gray
## # … with 77 more rows

Or by using select_if()

starwars %>% 
    select_if(is.numeric)

## # A tibble: 87 × 3
##    height  mass birth_year
##     <int> <dbl>      <dbl>
##  1    172    77       19  
##  2    167    75      112  
##  3     96    32       33  
##  4    202   136       41.9
##  5    150    49       19  
##  6    178   120       52  
##  7    165    75       47  
##  8     97    32       NA  
##  9    183    84       24  
## 10    182    77       57  
## # … with 77 more rows

To drop a column use - and to select a range of columns use :.

starwars %>% 
    select(-(hair_color:starships))

## # A tibble: 87 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows

To store the results of the filtering, assign the output to a new variable or the same variable if you want to override it. We’ll drop the three final columns and store it with the same variable name.

starwars <- 
    starwars %>% 
    select(-(films:starships))

str(starwars)

## tibble [87 × 11] (S3: tbl_df/tbl/data.frame)
##  $ name      : chr [1:87] "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
##  $ height    : int [1:87] 172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num [1:87] 77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr [1:87] "blond" NA NA "none" ...
##  $ skin_color: chr [1:87] "fair" "gold" "white, blue" "white" ...
##  $ eye_color : chr [1:87] "blue" "yellow" "red" "yellow" ...
##  $ birth_year: num [1:87] 19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ sex       : chr [1:87] "male" "none" "none" "male" ...
##  $ gender    : chr [1:87] "masculine" "masculine" "masculine" "masculine" ...
##  $ homeworld : chr [1:87] "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
##  $ species   : chr [1:87] "Human" "Droid" "Droid" "Human" ...

`filter()`

To subset observations based on a condition use filter(). As with select() you can put the data frame inside the function or use %>%.

filter(starwars, species == "Human")

## # A tibble: 35 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Wilhuff Ta…    180    NA auburn… fair    blue       64   male  mascu… Eriadu 
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 25 more rows, 1 more variable: species <chr>, and abbreviated variable
## #   names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

starwars %>%
        filter(species == "Human")

## # A tibble: 35 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  3 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  4 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  5 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  6 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  7 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  8 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
##  9 Wilhuff Ta…    180    NA auburn… fair    blue       64   male  mascu… Eriadu 
## 10 Han Solo       180    80 brown   fair    brown      29   male  mascu… Corell…
## # … with 25 more rows, 1 more variable: species <chr>, and abbreviated variable
## #   names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

Conditional expressions and Relational operators

filter() uses conditional expressions to subset the data frame. A conditional expression tests a value against a condition and returns either TRUE or FALSE. For example, the conditional expression 1 == 2 can be read as “is the value 1 equal to the value 2?”. The result of that test is FALSE. 1 <= 2 can be read “is 1 less than or equal to 2”, which should return TRUE.

1==2

## [1] FALSE

1 <= 2

## [1] TRUE

As we saw in an earlier lesson, R can apply this evaluation for an entire vector.

x <- c(1, 2, 4, 8, 16)
x < 8

## [1]  TRUE  TRUE  TRUE FALSE FALSE

R evaluates each member of the vector and returns TRUE or FALSE.

filter() applies a conditional expression to a data frame column and keeps those rows that evaluate to TRUE.

starwars %>%
        filter(mass > 100)

## # A tibble: 10 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  2 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  3 Chewbacca      228   112 brown   unknown blue      200   male  mascu… Kashyy…
##  4 Jabba Desi…    175  1358 <NA>    green-… orange    600   herm… mascu… Nal Hu…
##  5 Jek Tono P…    180   110 brown   fair    blue       NA   male  mascu… Bestin…
##  6 IG-88          200   140 none    metal   red        15   none  mascu… <NA>   
##  7 Bossk          190   113 none    green   red        53   male  mascu… Trando…
##  8 Dexter Jet…    198   102 none    brown   yellow     NA   male  mascu… Ojom   
##  9 Grievous       216   159 none    brown,… green,…    NA   male  mascu… Kalee  
## 10 Tarfful        234   136 brown   brown   blue       NA   male  mascu… Kashyy…
## # … with 1 more variable: species <chr>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

filter() uses the following relational operators:

== : equal to
!= : not equal to
< : less than
> : greater than
<= : less than or equal to
>= : greater than or equal to

You can also use !, &, and | to combine conditions. Respectively, these mean NOT, AND, OR. filter(eye_color == 'yellow' & species == "Human") returns all those individuals with yellow eyes and who are human

starwars %>% 
    filter(eye_color == 'yellow' & species == "Human")

## # A tibble: 2 × 11
##   name        height  mass hair_c…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>        <int> <dbl> <chr>    <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 Darth Vader    202   136 none     white   yellow     41.9 male  mascu… Tatooi…
## 2 Palpatine      170    75 grey     pale    yellow     82   male  mascu… Naboo  
## # … with 1 more variable: species <chr>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

filter(eye_color == 'yellow' & !species == "Human") means “individuals where eye_color is equal to ‘yellow’ AND species is NOT equal to ‘human’.

starwars %>% 
    filter(eye_color == 'yellow' & !species == "Human")

## # A tibble: 9 × 11
##   name         height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##   <chr>         <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
## 1 C-3PO           167    75 <NA>    gold    yellow      112 none  mascu… Tatooi…
## 2 Watto           137    NA black   blue, … yellow       NA male  mascu… Toydar…
## 3 Darth Maul      175    80 none    red     yellow       54 male  mascu… Dathom…
## 4 Dud Bolt         94    45 none    blue, … yellow       NA male  mascu… Vulpter
## 5 Ki-Adi-Mundi    198    82 white   pale    yellow       92 male  mascu… Cerea  
## 6 Yarael Poof     264    NA none    white   yellow       NA male  mascu… Quermia
## 7 Poggle the …    183    80 none    green   yellow       NA male  mascu… Geonos…
## 8 Zam Wesell      168    55 blonde  fair, … yellow       NA fema… femin… Zolan  
## 9 Dexter Jett…    198   102 none    brown   yellow       NA male  mascu… Ojom   
## # … with 1 more variable: species <chr>, and abbreviated variable names
## #   ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

filter(eye_color == 'yellow' | species == "Human") returns those individuals for which eye color is equal to yellow, independent of species, OR those who are humans, independent of eye color.

starwars %>% 
    filter(eye_color == 'yellow' | species == "Human")

## # A tibble: 44 × 11
##    name        height  mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex   gender homew…⁵
##    <chr>        <int> <dbl> <chr>   <chr>   <chr>     <dbl> <chr> <chr>  <chr>  
##  1 Luke Skywa…    172    77 blond   fair    blue       19   male  mascu… Tatooi…
##  2 C-3PO          167    75 <NA>    gold    yellow    112   none  mascu… Tatooi…
##  3 Darth Vader    202   136 none    white   yellow     41.9 male  mascu… Tatooi…
##  4 Leia Organa    150    49 brown   light   brown      19   fema… femin… Aldera…
##  5 Owen Lars      178   120 brown,… light   blue       52   male  mascu… Tatooi…
##  6 Beru White…    165    75 brown   light   blue       47   fema… femin… Tatooi…
##  7 Biggs Dark…    183    84 black   light   brown      24   male  mascu… Tatooi…
##  8 Obi-Wan Ke…    182    77 auburn… fair    blue-g…    57   male  mascu… Stewjon
##  9 Anakin Sky…    188    84 blond   fair    blue       41.9 male  mascu… Tatooi…
## 10 Wilhuff Ta…    180    NA auburn… fair    blue       64   male  mascu… Eriadu 
## # … with 34 more rows, 1 more variable: species <chr>, and abbreviated variable
## #   names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld

`summarise()`

Combing this verb with some of R’s base functions allows you to easily get descriptive statistics on your data.

starwars %>% 
    filter(birth_year > 20) %>% 
    summarise(mean_height = mean(height))

## # A tibble: 1 × 1
##   mean_height
##         <dbl>
## 1        175.

starwars %>% 
    select(species) %>% 
    filter(species != "Human") %>% 
    summarise(count_nonhuman = n())

## # A tibble: 1 × 1
##   count_nonhuman
##            <int>
## 1             48

starwars %>%
    summarise(mn_height = mean(height, na.rm = T),
              sd_height = sd(height, na.rm = T),
              min_height = min(height, na.rm = T),
              max_height = max(height, na.rm = T))

## # A tibble: 1 × 4
##   mn_height sd_height min_height max_height
##       <dbl>     <dbl>      <int>      <int>
## 1      174.      34.8         66        264

`group_by()`

You may have noticed that summarise() produces a data frame as its output. If you only need one value this may not be all that useful to you. However, your data is often more complex and you would like to know if there are differences between groups.

starwars %>% 
    group_by(species) %>% 
    summarise(
        count = n(),
        mn_mass = mean(mass, na.rm = T))

## # A tibble: 38 × 3
##    species   count mn_mass
##    <chr>     <int>   <dbl>
##  1 Aleena        1    15  
##  2 Besalisk      1   102  
##  3 Cerean        1    82  
##  4 Chagrian      1   NaN  
##  5 Clawdite      1    55  
##  6 Droid         6    69.8
##  7 Dug           1    40  
##  8 Ewok          1    20  
##  9 Geonosian     1    80  
## 10 Gungan        3    74  
## # … with 28 more rows

The output of dplyr functions are data frame, so you can actually create a date frame and filter all within the same chain. Here we’ll summarize by species then filter the summary output by the count data.

starwars %>% 
    group_by(species) %>% 
    summarise(
        count = n(),
        mn_mass = mean(mass, na.rm = T)) %>% 
    filter(count > 5)

## # A tibble: 2 × 3
##   species count mn_mass
##   <chr>   <int>   <dbl>
## 1 Droid       6    69.8
## 2 Human      35    82.8

mutate()

Often you will need to modify a variable for some reason. Maybe you need to scale the data or log transform it. dplyr makes this pretty easy. mutate() will add a new column to the end of the data frame and won’t override the original data.

starwars %>% 
    select(name:mass) %>% 
    mutate(height_meter = height/100)

## # A tibble: 87 × 4
##    name               height  mass height_meter
##    <chr>               <int> <dbl>        <dbl>
##  1 Luke Skywalker        172    77         1.72
##  2 C-3PO                 167    75         1.67
##  3 R2-D2                  96    32         0.96
##  4 Darth Vader           202   136         2.02
##  5 Leia Organa           150    49         1.5 
##  6 Owen Lars             178   120         1.78
##  7 Beru Whitesun lars    165    75         1.65
##  8 R5-D4                  97    32         0.97
##  9 Biggs Darklighter     183    84         1.83
## 10 Obi-Wan Kenobi        182    77         1.82
## # … with 77 more rows

As with all of these, remember that if you don’t assign the output to a variable whatever you do won’t be stored.

There is a lot more you can do with dplyr, so I would recommend checking out the documentation. The book R for Data Science by Garrett Grolemund and Hadley Wickham has a chapter with a lot of useful information.

================================================================================

Session information:

Last update on 2020-11-05

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.0.10
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.14  knitr_1.40       magrittr_2.0.3   tidyselect_1.1.2
##  [5] R6_2.5.1         rlang_1.0.5      fastmap_1.1.0    fansi_1.0.3     
##  [9] stringr_1.4.1    tools_4.2.1      xfun_0.32        utf8_1.2.2      
## [13] DBI_1.1.2        cli_3.3.0        jquerylib_0.1.4  ellipsis_0.3.2  
## [17] htmltools_0.5.3  assertthat_0.2.1 yaml_2.3.5       digest_0.6.29   
## [21] tibble_3.1.8     lifecycle_1.0.1  purrr_0.3.4      sass_0.4.2      
## [25] vctrs_0.4.1      glue_1.6.2       cachem_1.0.6     evaluate_0.16   
## [29] rmarkdown_2.16   stringi_1.7.8    compiler_4.2.1   bslib_0.4.0     
## [33] pillar_1.8.1     generics_0.1.3   jsonlite_1.8.0   pkgconfig_2.0.3

================================================================================

Data wrangling and dplyr

Tidy data

Data sets in R

Loading a prexisting data set

dplyr

select()

filter()