Introduction

Reading your data into any program seems like it should be simple and straight-forward, however, this is often not the case.

Luckily there are functions that can do the job for us!

There are numerous functions that you can use to read data into R. The function that you will use most will likely depend on what type of files you have your data stored in. For example, if you import from Excel, SPSS, Stata, etc files you will likely need to install a package which is built to hand the particular formats.

We will only cover importing from The most common, and straight-forward, are the functions read.csv(), read.csv2(), and read.table().

More information! read.table() is often used for .txt files while read.csv() & read.csv2() are used with .csv files. Technically, this is not required. Both read.csv() and read.csv2() are “wrappers” around the read.table() function. That is, they use the read.table() function but they set defaults for some of the arguments differently. The most important argument setting is sep, which is short for “separator”. This is the character that the function looks for to divide the values. The default in read.table() is white space. Spaces, tabs, and new lines are treated as dividing values. The default sep value in read.csv() is a comma, “,”. This is why read.csv() is useful for reading in CSV files, CSV is short for “Comma separated values”. read.csv2() has a default sep of “;” which is useful in countries where a comma “,” is used as a decimal point and the semi-colon “;” separates values.

For those of you with German language keyboards, pay attention to the sep argument and if you have trouble with reading your data into R, try read.csv2().

Reading in Data

Let’s start with some simple examples.
First, we’re going to create some fake data and save the file so we don’t have to worry about pathways and working directories. We’ll create 100 observations of four variables. We’ll make column one a numeric variable and name the column “numerisch”. Column two will be integers and we’ll title it “ganze_zahlen”. The third will be categorical and we’ll call it “kategorisch”. The fourth will be binary and we’ll call it “binaer”.

data1 <- data.frame(
        numerisch = rnorm(100, mean = 4, sd = 2),
        ganze_zahlen = 1:100,
        kategorisch = rep(c("a", "b", "c", "d"), length.out = 100),
        binaer =rep(c(TRUE, FALSE), length.out = 100)
)
                    
write.csv(data1, file = "../data/data1.csv", row.names = F)
write.table(data1, file = "../data/data1.txt", row.names = F)

Now lets read the csv file back into R

data1_csv <- read.csv("../data/data1.csv")

How can we tell if it worked?

First, if you are using R Studio, you will see that there is a new object in your Environment pane. You can directly click on the data in the Environment window and the full data set will open in your Source pane. You can achieve the same result by using the function View() and putting the data set name (the variable name in R, not the file name) as the x = argument (e.g., View(x = data1_csv)).

You’ll usually have a lot of data and viewing the data in this format is a bit of a sensory overload. The functions head() and str() are super useful.

head() prints the first 6 rows (observations) of the data frame (by default, you can change the number of rows to show by specifying the n = argument).

head(data1_csv)

##    numerisch ganze_zahlen kategorisch binaer
## 1  5.1882262            1           a   TRUE
## 2  5.2639747            2           b  FALSE
## 3  1.2985115            3           c   TRUE
## 4 -0.2650720            4           d  FALSE
## 5  5.9383152            5           a   TRUE
## 6  0.3691406            6           b  FALSE

The column names are above the columns and it looks like the data we were expecting. However, we are expecting four columns but we see five and the first column does have a column name. Why? The very first shows you the index. These are also called the row names. This information can be useful for subsetting your data.

str() prints the structure of the data. Here you can see the number and names of columns (aka variables), the number of observations, and the data type of each column.

str(data1_csv)

## 'data.frame':    100 obs. of  4 variables:
##  $ numerisch   : num  5.188 5.264 1.299 -0.265 5.938 ...
##  $ ganze_zahlen: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ kategorisch : chr  "a" "b" "c" "d" ...
##  $ binaer      : logi  TRUE FALSE TRUE FALSE TRUE FALSE ...

In this data set, we just have 100 observations (rows) of 4 variables with their column names. Our categorical variable is a “Factor”. This is important and we’ll come back to it.

Now, let’s try the .txt file.

data1_txt <- read.table("../data/data1.txt")
head(data1_txt)

##                   V1           V2          V3     V4
## 1          numerisch ganze_zahlen kategorisch binaer
## 2    5.1882261662467            1           a   TRUE
## 3   5.26397465025898            2           b  FALSE
## 4   1.29851152336494            3           c   TRUE
## 5 -0.265071965429966            4           d  FALSE
## 6   5.93831519188064            5           a   TRUE

str(data1_txt)

## 'data.frame':    101 obs. of  4 variables:
##  $ V1: chr  "numerisch" "5.1882261662467" "5.26397465025898" "1.29851152336494" ...
##  $ V2: chr  "ganze_zahlen" "1" "2" "3" ...
##  $ V3: chr  "kategorisch" "a" "b" "c" ...
##  $ V4: chr  "binaer" "TRUE" "FALSE" "TRUE" ...

Why do we have 101 observations and why are all of our variables factors?

The read.table() and read.csv() functions have an argument, header, that tells R how to interpret the first line in a data file. If your data contains a header you want to make sure that header = T. A column can only contain data of a single type. It is easier to make a number into a string than a string into a number, so if R sees any strings in a column it interprets all of the values as strings (or factors).

We didn’t get this problem in read.csv() because header = T is the default for this function. header = F is the default for read.table().

data1_txt <- read.table("../data/data1.txt", header = T)
head(data1_txt)

##    numerisch ganze_zahlen kategorisch binaer
## 1  5.1882262            1           a   TRUE
## 2  5.2639747            2           b  FALSE
## 3  1.2985115            3           c   TRUE
## 4 -0.2650720            4           d  FALSE
## 5  5.9383152            5           a   TRUE
## 6  0.3691406            6           b  FALSE

str(data1_txt)

## 'data.frame':    100 obs. of  4 variables:
##  $ numerisch   : num  5.188 5.264 1.299 -0.265 5.938 ...
##  $ ganze_zahlen: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ kategorisch : chr  "a" "b" "c" "d" ...
##  $ binaer      : logi  TRUE FALSE TRUE FALSE TRUE FALSE ...

Let’s try another tricky example.

data2 <- data.frame(good_dogs = c("Rex", "Lassie", "Petey", "Sergeant Stubby", "Laika"),
                    age = c(2, 5, 3, 9, 3))
                    
write.csv(data2, file = "../data/data2.csv", row.names = F)

data2_csv <-read.csv("../data/data2.csv", header = T)

What happens if you try to add an observation?

new_row <- c("Lady", 6)

rbind(data2_csv, new_row)

##         good_dogs age
## 1             Rex   2
## 2          Lassie   5
## 3           Petey   3
## 4 Sergeant Stubby   9
## 5           Laika   3
## 6            Lady   6

The problem here is that, by default, R often treats strings as “factors”. Factors store categorical values as integers which can be useful for efficient memory storage. When a data frame is created all of the possible factor values are assigned so you cannot easily add a new categorical value (like you can with numerical values).

To overcome this, you can set the argument stringsAsFactors to F (or FALSE). In old versions of R, the default was stringsAsFactors = TRUE, but it seems as though the default has changed in newer versions.

data2_csv <-read.csv("../data/data2.csv", header = T, stringsAsFactors = F)
new_row <- c("Lady", 6)

rbind(data2_csv, new_row)

##         good_dogs age
## 1             Rex   2
## 2          Lassie   5
## 3           Petey   3
## 4 Sergeant Stubby   9
## 5           Laika   3
## 6            Lady   6

Be aware, however, when we start creating some statistical models, we may need categorical variables to be represented as factors. It’s a good idea to make the argument explicit and set it as TRUE OR FALSE depending on how you are using the data.

You can also change the data type later, but it’s easier if you know how you want it represented beforehand.

To change a factor to a string of characters, you can use the function as.character().

data2_csv$good_dogs <- as.character(data2_csv$good_dogs)

Changing data types in a data frame is also pretty straight-forward in the package dplyr which we’ll cover in another lesson.

================================================================================

Session information:

Last update on 2020-11-04

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_AT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_AT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_AT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.8.0  magrittr_2.0.3 
##  [5] evaluate_0.16   stringi_1.7.8   cachem_1.0.6    rlang_1.0.5    
##  [9] cli_3.3.0       rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.0    
## [13] rmarkdown_2.16  tools_4.2.1     stringr_1.4.1   xfun_0.32      
## [17] yaml_2.3.5      fastmap_1.1.0   compiler_4.2.1  htmltools_0.5.3
## [21] knitr_1.40      sass_0.4.2

================================================================================

Reading your data into R

Introduction

Reading in Data

Session information: