If you’ve worked in R you’ve probably received plenty of error messages that are super confusing. Sometimes those error messages occur because your data are stored as the wrong object type.
Let’s look at two ways to store a range of numbers. In R, you can use
the function c()
to concatenate values. You can enter the
numbers 1 through 10 by using two methods. A colon :
is
used for ranges with the first number being the smallest and the last
the largest. You can also enter each value.
x <- c(1:10)
y <- c(1,2,3,4,5,6,7,8,9,10)
x
## [1] 1 2 3 4 5 6 7 8 9 10
y
## [1] 1 2 3 4 5 6 7 8 9 10
On the surface the objects x
and y
look the
same. You can check if they are both numeric objects by using the
function is.numeric()
is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
Both are numeric, however when you check if the two are exactly identical, we find that they are not!
identical(x^2, y^2)
## [1] TRUE
identical(x + 0.5, y + 0.5)
## [1] TRUE
identical(x, y)
## [1] FALSE
So what’s the deal? Well, the output of :
is going to be
an integer. Manually entering the numbers, however, gives you a floating
point number (called a “double”)
typeof(x)
## [1] "integer"
typeof(y)
## [1] "double"
For the most part, R is actually pretty good at dealing data being in the wrong format. However, it’s still not as good as a human and it will make mistakes.
So what are the different data types?
The one-dimensional structures are the basic building blocks which can be used to build the derived objects like data frames and matricies.
This term can be a bit confusing, especially since R is used so much in statistics. The term does not have anything to do with the math term “vector”. Rather, in this context it essentially means a sequence of values. Contrast this with an “Null” object which has a length of 0.
x <- 1
length(x)
## [1] 1
is.vector(x) ## TMI: is.vector() technically checks if the object is a vector with no attributes other than names. To truly check if an object is a vector use: is.atomic(x) || is.list(x) . For our purposes now, is.vector() will work.
## [1] TRUE
x <- c(1, 2, 3, 4)
length(x)
## [1] 4
is.vector(x)
## [1] TRUE
x <- 1:1000
length(x)
## [1] 1000
is.vector(x)
## [1] TRUE
x <- 0
length(x)
## [1] 1
is.vector(x)
## [1] TRUE
x <- NULL
length(x)
## [1] 0
is.vector(x)
## [1] FALSE
There are two types of vectors: lists and atomic vectors. The main difference between these two object types is that atomic vectors are sequences of data which are all the same type. Lists can contain multiple types of data.
There are four types of atomic vectors: logical, integer, double, and
character. Logical are either TRUE
or FALSE
or
NA
. These are most often used in comparisons. Integer and
double are both numeric, with the former containing integer data and the
latter sequences of real numbers.
logical_vector <- c(TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)
is.logical(logical_vector)
## [1] TRUE
character_vector <- c("true", "false", "true", "false", "true")
is.logical(character_vector)
## [1] FALSE
is.character(character_vector)
## [1] TRUE
is.logical(as.logical(character_vector))
## [1] TRUE
A list is a sequence of heterogenous data.
x <- c("one", 1.2, 1, TRUE, c(1,2,3,4,5), c("hello", "world"))
x[1]
## [1] "one"
If you think about an excel data sheet, vectors would be one column of values. The number of rows is variable but you only have one column. Dataframes and matrices, however, are more similar to the excel datasheet in that they have columns as well. Dataframes are like lists in that they can have multiple data types (though each column can only be of one type). Matricies must have homogeneous data.
df <- data.frame(
categorical = sample(c("a", "b", "c"), size = 300, replace = T),
double = rnorm(300, mean=200, sd=30),
integer = floor(rnorm(300, mean = 120, sd = 14)),
logical = sample(c(TRUE, FALSE), size = 300, replace = T)
)
head(df)
## categorical double integer logical
## 1 a 257.4975 139 FALSE
## 2 a 236.6566 151 TRUE
## 3 c 216.6820 99 FALSE
## 4 b 229.8264 119 FALSE
## 5 c 164.2863 92 TRUE
## 6 b 188.1514 135 TRUE
================================================================================
Last update on 2020-10-14
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_AT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=de_AT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_AT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_AT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.0 magrittr_2.0.3
## [5] evaluate_0.16 stringi_1.7.8 cachem_1.0.6 rlang_1.0.5
## [9] cli_3.3.0 rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.0
## [13] rmarkdown_2.16 tools_4.2.1 stringr_1.4.1 xfun_0.32
## [17] yaml_2.3.5 fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.3
## [21] knitr_1.40 sass_0.4.2
================================================================================
Copyright © 2022 Dan C. Mann. All rights reserved.