There are different data structures in R. In this document, I briefly explain individual types. If you need more details, you can find them in R’s documentation.
R’s base data structures can be thought by their dimensionality (1 dimension, 2 dimensions, or N dimensions) and whether the contents are of the same type (homogeneous) or of the different types (heterogeneous). This gives rise to the five data types most often used in data analysis:
Dimension | Homogeneous | Heterogeneous |
---|---|---|
1d | Vector | List |
2d | Matrix | Data frame |
Nd | Array |
Scalars and types of variables
Types of variables
Note that R has a scalar type as well. A scalar data structure is the most basic data type that holds only a single atomic value at a time. Scalars don’t have to be numeric(often called double), they can also be different types such as characters (i.e. strings), integers, or logical values. We can check the type of a variable by using the typeof()
function:
typeof(1)
## [1] "double"
typeof("politics")
## [1] "character"
typeof(TRUE)
## [1] "logical"
Note that having quotation marks around a number will give you a character variable, instead of a numeric variable. For example,
typeof("1")
## [1] "character"
Factors
Asides from types, there is also a property called Attributes. Attributes can be thought of as a named list (with unique names), and can be accessed individually with attr()
or all at once (as a list) with attributes()
.
One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class
, “factor”, which makes them behave differently from regular integer vectors, and the levels
, which defines the set of allowed values.
x <- factor(c("a", "b", "b", "a"))
x
## [1] a b b a ## Levels: a b
class(x)
## [1] "factor"
levels(x)
## [1] "a" "b"
Coercion
We can change the type of a variable to type x
using the function as.x
. This process is called “coercion”. For example, the following code changes the number 65 to the string “65”:
as.character(65)
## [1] "65"
typeof(65)
## [1] "double"
typeof(as.character(65))
## [1] "character"
Similarly, you can coerce one type to another by using as.character()
, as.double()
, as.integer()
, or as.logical()
.
Vectors
The basic data structure in R is the vector, a 1-dimensional array whose entries are the same type.
Creation
The following code produces a vector containing the numbers 1, 3, 5, 7, and 9:
vec <- c(1,3,5,7,9)
vec
## [1] 1 3 5 7 9
We don’t have to type out all the numbers. The following code assigns a vector of the numbers from 1 to 100 to vec
:
vec <- 1:100
vec
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 ## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
What if I only want even numbers from 1 to 100 (inclusive)? We can manipulate vectors using arithmetic operations (just like numbers). Note that arithmetic operations happen element-wise.
even <- 1:50 * 2
even
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 ## [37] 74 76 78 80 82 84 86 88 90 92 94 96 98 100
Or we can use seq()
function:
even <- seq(2,100,2) # seq(start number, end number, by)
even
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 ## [37] 74 76 78 80 82 84 86 88 90 92 94 96 98 100
We can also use c()
function to combine (“concatenate”) several small vectors into one large vector.
z <- 1:5
z <- c(z,3,z)
z
## [1] 1 2 3 4 5 3 1 2 3 4 5
Checking
We can check if a variable is of type vector
, using is.vector()
or is.atomic()
. Other types of variables can be also checked using is.character()
, is.double()
, is.integer()
, and is.logical()
.
is.vector(vec)
## [1] TRUE
is.atomic(vec)
## [1] TRUE
Use the length()
function to figure out how many elements there are in a vector.
odd <-seq(1,99,2)
Extracting elements
We can get multiple elements of a vector as well. The following code extracts the 5th to 9th even number (inclusive), and assigns it to the variable y
:
y <- even[5:9]
y
## [1] 10 12 14 16 18
This extracts just the 3rd and 5th even numbers:
even[c(3,5)]
## [1] 6 10
We can also erase certain numbers using negative indexing. Let’s say I want all even numbers except the first two:
even[-c(1,2)]
## [1] 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 ## [37] 78 80 82 84 86 88 90 92 94 96 98 100
Matrices and Arrays
Matrices are just the 2-dimensional analogs of vectors while arrays are the n-dimensional analogs of vectors. As with vectors, elements of matrices and arrays have to be of the same type. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.
Creation
Matrices and arrays are created with matrix()
and array()
, or by using the assignment form of dim()
:
a <- matrix(1:6, ncol = 3, nrow = 2)
a
## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
b <- array(1:12, c(2, 3, 2))
b
## , , 1 ## ## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6 ## ## , , 2 ## ## [,1] [,2] [,3] ## [1,] 7 9 11 ## [2,] 8 10 12
c <- 1:6
dim(c) <- c(2, 3)
c
## [,1] [,2] [,3] ## [1,] 1 3 5 ## [2,] 2 4 6
Notice that R takes the elements in the vector you give it and fills in the matrix column by column. If we want the elements to be filled in by row instead, we have to put in a byrow = TRUE argument:
A <- matrix(1:6, nrow = 2, byrow=TRUE)
A
## [,1] [,2] [,3] ## [1,] 1 2 3 ## [2,] 4 5 6
Checking
To get the dimensions of the matrix, we can use the dim()
, nrow()
and ncol()
functions.
length(a)
## [1] 6
nrow(a)
## [1] 2
ncol(a)
## [1] 3
You can change the names of row and column as well.
rownames(a)<-c("A","B")
colnames(a)<-c("a","b","c")
a
## a b c ## A 1 3 5 ## B 2 4 6
Extracting elements
To access the element in the i
th row and j
column for the matrix A
, use the index i,j
:
A[1,2]
## [1] 2
A[1,]
## [1] 1 2 3
A[,1]
## [1] 1 4
Lists
In all the data structures so far, the elements have to be of the same type.
Creation
To have elements on different types in one data structure, we can use a list, which we create with list()
. We can think of a list as a collection of key-value pairs. Keys should be strings.
event <- list(year = "2021", month = "Aug")
event
## $year ## [1] "2021" ## ## $month ## [1] "Aug"
The str()
function can be used to inspect what is inside person:
str(event)
## List of 2 ## $ year : chr "2021" ## $ month: chr "Aug"
To access the year
element person, we have 2 options:
event[["year"]]
## [1] "2021"
# or
event$year
## [1] "2021"
Checking
The elements of a list can be anything, even another data structure! Let’s add the Saturdays in August:
event$saturday <- c(7,4,21,28)
str(event)
## List of 3 ## $ year : chr "2021" ## $ month : chr "Aug" ## $ saturday: num [1:4] 7 4 21 28
To see the keys associated with a list, use the names()
function:
names(event)
## [1] "year" "month" "saturday"
Data frames
A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names()
, colnames()
, and rownames()
, although names()
and colnames()
are the same thing. The length()
of a data frame is the length of the underlying list and so is the same as ncol()
; nrow()
gives the number of rows.
You can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix). We will talk about subsetting later when we cover how to manipulate data in R.
Creation
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
Checking
To check if an object is a data frame, use class()
or test explicitly with is.data.frame()
:
class(df)
## [1] "data.frame"
is.data.frame(df)
## [1] TRUE
You can check the names of rows and columns.
colnames(df)
## [1] "x" "y"
rownames(df)
## [1] "1" "2" "3"
You can also check the numbers of rows and columns.
ncol(df)
## [1] 2
nrow(df)
## [1] 3
Coercion
You can coerce an object to a data frame with as.data.frame()
:
- A vector will create a one-column data frame.
- A list will create one column for each element; it’s an error if they’re not all the same length.
- A matrix will create a data frame with the same number of columns and rows as the matrix.
vec<-c(1:5)
vec<-as.data.frame(vec)
str(vec)
## 'data.frame': 5 obs. of 1 variable: ## $ vec: int 1 2 3 4 5
list<-list(1:2, 1:3, 1:4)
list<-as.data.frame(list)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4
matrix<-a
matrix<-as.data.frame(matrix)
str(matrix)
## 'data.frame': 2 obs. of 3 variables: ## $ a: int 1 2 ## $ b: int 3 4 ## $ c: int 5 6