Manipulate Data in R

In this document, you will learn how to prepare data using R. We start from setting a working directory, then learn how to install packages. We also look at how to import, view, manipulate, and Export data.

Before Loading Data

Set the Working Directory

Before loading data and start playing with it, you need to tell R where to find the data in your computer, i.e., the directory or the path. So before we go in further, let’s first see how can we set and change the working directory.

To set a directory, you use the setwd() function.

#view your current working directory
getwd()

#change the working directory to your project folder
setwd("/Users/username/RWork")

#view what is inside this working directory
dir()

**A note about directory/filenames in Windows:* In the Windows system, path names use backslashes, e.g. C:\Users\username\RWork\. R, on the other hand, inherits its file and folder naming conventions from Unix which uses forward slashes instead: C:/Users/username/RWork/.

Install packages

A package bundles together code, data, documentation, and tests, and is easy to share with others. In R’s CRAN (Comprehensive R Archive Network), there are huge variety of packages available for you to use. This is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

To use those packages, you first need to install them. You only need to do this once. To install a package pkgName, simply type install.packages("pkgName") into the console.

install.packages("dplyr")

To use functions in the package, we first have to load the package: library(pkgName). If the package has not been installed yet, we will get an error. Otherwise, functions in the package are now available for use.

library(dplyr)

Import Data

Packages needed for different data formats

You can import all kinds of data format in R using various packages.

File Type	Package	Function
Excel (.xlsx)	openslxs	read.xlsx
Stata (.dta, versions 12 and earlier)	foreign	read.dta
Stata (.dta, versions 13 and 14)	readstata13	read.dta13
SPSS (.sav)	foreign	read.spss
Comma separated (.csv)	none	read.csv
Tab delimited (.tab or .txt)	none	read.table
R data files (.Rdata)	none	load

#Install necessary packages, you only need to do this once
install.packages("openxlsx")
install.packages("foreign")
install.packages("readstata13")

Load packages and import data

Although you only need to install packages to R once, you need to load the required packages in R using the library() function before you analysis every time in the beginning of your R session.

1.Read Excel file

#You need to library() a package before using it in R. 
library(openxlsx) 

#If your data is in the current working directory, you can simply do:
data <- read.xlsx("filename.xlsx")

# If your data is not in the current working directory, you need to give R the path to the data, such as 
data <- read.xlsx("/Users/username/RWork/filename.xlsx") # do not run

2.Read a comma-seperated (.csv) file

# If your data is in the current working directory, you can simply do:
ANES <- read.csv("cleaned_ANES.csv")
# If your data is not in the current working directory, you need to give the path to the data, such as
ANES <- read.csv("/Users/username/RWork/cleaned_ANES.csv")

3.Read a R data (.Rdata) file

load("filename.Rdata")

4.Read a Stata (.dta, versions 12 and earlier) file

library(foreign)
data<-read.dta("filename.dta")

5.Read a Stata (.dta, versions 13 and 14) file

The “foreign” pack does not support stata data after version 12. We can use “readstat13” to import from the later versions of stata instead.

library(readstata13)
# Read from stata version 13&14
data<-read.dta13("filename.dta")

6.Read .sav (SPSS) file

library(foreign)
data <- read.spss("filename.sav", to.data.frame = TRUE)

7. Read a Tab deliminted (.tab or .txt) file

data <- read.table("JoP_R_data.txt",sep="\t")

Some useful arguments within read function

In some cases, you need to specify additional information of the data in the data importing function to be able to load the data in R in addition to the path and file name. You can get information about the arguments and their default in the R documents using the help() function.

#Practice and review what arguments are there inside different data import functions
?read.csv
?read.xlsx
?read.spss

Header

The header argument tells R whether to treat the first row of the data as the variable names. In the examples above, you probably have noticed that the default header argument in read.table is FALSE, while the default header argument in read.csv is TRUE. It means that by default, read.table() function will not treat the top row of the data as the variable name. If the first row of your data is the variable name, you’d want to change the default from FALSE to TRUE:

# If the first row of your data contains the variable names, you need to specify the header argument to TRUE in read.table. You do not need to specify the header to TRUE in read.csv, as it is the default in read.csv.
read.table(file, header = TRUE)
# If the first row of your csv data is not the variable names, you need to change the default of header in read.csv to FALSE
read.csv(file, header = FALSE)

sep

The sep argument tell R how to separate the values in the raw data. The default of read.table is sep=“”, which tells R to separate values on one or more white spaces. For example, under this default, “Penn State” will be treated as two values, and will be put into two columns in a data frame.

If your data is not white space separated, you need to change this default accordingly. In the “Penn State” example, the “Penn State” might be one variable, and should be put into one column instead of two. In this case, the data is not separated by white space, but tabs. If your data is tab separated, you need to change the sep argument to \(“\t”\):

read.table(file, sep="\t")

If your data is comma separated, the sep argument should be “,” which is the default in read.csv().

to.data.frame in read.spss

When you read a SPSS data into R, the read.spss() function does not change the format of the data as a data frame. However, as most of our analyses are on a data frame, you’d want to change this default when you read a SPSS file:

read.spss(file, to.data.frame = TRUE)

Get existing data

There are multiple ways you can get data into R. First, there are data included in packages. Second, you can find datasets online and introduce a resource with a lot of links to political datasets. Third, you can load data from GitHub repositories.

Using data from data packages

A lot of packages include dataset. For example, the package poliscidata provide General Social Survey, American National Election Survey, and States dataset.

install.packages("poliscidata")
library(poliscidata)

Using specific packages to load data

Some data are directly downloadable to R using specific packages. For example, to get data from the General Social Survey, you can use the gssr package (Kieran healy, 2019). This package is not on CRAN, so we need to use the devtools package to install it and then use library() to load it. The easiest way to get the GSS data is to simply ask to get all of the data in an object called gss_all with data(). This loads 6108 variables contained in the cumulative GSS data.

devtools::install_github("kjhealy/gssr")
library("gssr")
data(gss_all)

There are several other functions that might be help conditional upon what you would like to study. Accordingly, we recommend that you consult https://kjhealy.github.io/gssr/ for various examples on how to access the data and the documentation of different variables.

Similarly, you can obtain the American National Election Study data, using the package anesr(Martherus, 2019). This package is also not available on CRAN for now, so we need to use install_github() function from the devtools pakcage. After that, we can lod the package.

devtools::install_github("jamesmartherus/anesr")
library("anesr")

To see all the datasets available via the package, simply type the following:

data(package="anesr")

If you run the above code, you will see that there is a dataset called timeseries_2020. To access to this data, use the data() function. This returns an object called timeseries_2020 with 8280 observations and 1381 variables. For further information, see https://github.com/jamesmartherus/anesr.

data(timeseries_2020)

As you may see from the data set, you there are too many variables that may not be of your interests. Also, the names of variables are not informative at all – they are marked as a combination of “V” and some numbers. So before using any publicly available data, I encourage you to check the documentations for data sets and use only variable they you need.

Load data from GitHub repositories

A lot of data today is available in GitHub repositories. To get data from GitHub, we can first set the file path (click on ‘Raw’ to get to the raw dataset) and the use read.csv() function.

urlfile<-'https://raw.githubusercontent.com/erikgahner/polls/master/polls.csv'
polls<-read.csv(url(urlfile))

View data

In addition to the View() function, there are many other useful functions for getting information on your data.

View some basic features

# list all the objects in your current working environment
ls()

# Load dataset from GitHub
urlfile<-'https://raw.githubusercontent.com/boyoon0715/R_Workshops/main/cleaned_ANES.csv'
ANES<-read.csv(url(urlfile))
# list the variable names
names(ANES)

##  [1] "caseid"              "weight_full"         "mode"                "voted2012"           "regist2016"          "primary2016"        
##  [7] "intend.vote"         "ideology"            "pid_strength"        "born_again"          "fundamentalist"      "race"               
## [13] "female"              "income"              "hannity"             "minutes60"           "cooper"              "matthews"           
## [19] "press"               "conan"               "big.bang"            "modern.family"       "rush"                "NPR"                
## [25] "voted2016"           "vote.choice2016"     "Dem.Pres.cand.FT"    "GOP.Pres.cand.FT"    "Lib.Pres.cand.FT"    "Green.Pres.cand.FT" 
## [31] "Dem.House.cand.FT"   "GOP.House.cand.FT"   "Ind.House.cand.FT"   "Dem.Sen.cand.FT"     "GOP.Sen.cand.FT"     "Ind.Sen.cand.FT"    
## [37] "Sr.Sen.FT"           "Jr.Sen.FT"           "Nonrun.Sen.FT"       "Dem.VP.cand.FT"      "GOP.VP.cand.FT"      "Roberts.FT"         
## [43] "Pope.FT"             "Christian.Fund.FT"   "Fem.FT"              "Liberals.FT"         "Labor.Unions.FT"     "Poor.FT"            
## [49] "Big.Business.FT"     "Conservatives.FT"    "SCt.FT"              "Gays.FT"             "Congress.FT"         "Rich.FT"            
## [55] "Muslims.FT"          "Christians.FT"       "Jews.FT"             "Tea.Party.FT"        "Police.FT"           "Trans.FT"           
## [61] "Scientists.FT"       "BLM.FT"              "tax.mill"            "vaccinces.in.school" "vaccines"            "Govt.Reduce.Eq"     
## [67] "Req.Equal"           "how.equal"           "Foreign.Im"          "immig.support"       "Free.Trade.Ag"       "marijuana"          
## [73] "Asians.FT"           "Hispanics.FT"        "Blacks.FT"           "Illegals.FT"         "Whites.FT"           "location"           
## [79] "educ"                "GOP"                 "Dem"                 "employed"            "cath_attend"         "prot_attend"        
## [85] "jew_attend"          "cath_nonatt"         "prot_nonatt"         "jew_nonatt"          "Obama12"             "Romney12"           
## [91] "yes.vote2016"        "Trump2016"           "Clinton2016"

# dimension of your data
dim(ANES)

## [1] 4271   93

# view the first 2 rows in the data
head(ANES,n=2)
# view the last 2 rows in the data
tail(ANES,n=2)

# class of an object
class(ANES)

## [1] "data.frame"

class(ANES$NPR)

## [1] "integer"

# view summary statistics of a variable
summary(ANES$NPR)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1452  0.0000  1.0000    1999

Subset data

# hand pick rows and columns of a data
ANES[c(1,2),c(2,3)] # first two rows and second and third column
# exclude certain rows and columns
ANES[-1, -c(1,2,3)] # exclude the first row and first three columns 
# select more than one variables
ANES[c("NPR","rush","hannity")]
# select based on the values of observations
ANES[which(ANES$NPR==1),]
# select based on multiple critieria
ANES[which(ANES$Dem.Pres.cand.FT >=20 & ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer scores towards Clinton AND Trump are BOTH bigger or equal to 20 
ANES[which(ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer score towards Clinton OR Trump is bigger or equal to 20 
# subset use the subset function
subset(ANES, ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20)

Sort data

You can sort your data on one or more variables, and you can set the order to be ascending or descending

# the order function returns a permutation which rearranges its first argument into ascending or descending order. The default is ascending. 
order(ANES$Dem.Pres.cand.FT)
# change into descending order using the decreasing argument inside order()
order(ANES$Dem.Pres.cand.FT, decreasing = TRUE)
# Sort the turnout data based on the ascending order of ANES 
ordered.ANES <- ANES[order(ANES$Dem.Pres.cand.FT,decreasing = TRUE),]
# Note after ordering, the NAs will always be at the end no matter whether the order is ascending or descending

Manipulate data

The data sets you have been using here (ANES) have already been cleaned. They are in the correct formats and contains everything you need for the class analysis. However, in reality when you use other people’s data or download data online, you might need to use several variables in the data to calculate the variable you are interested in, and you might need to correct the class of a variable, and you might need to recode the values in the data.

Create new variables

Some times, you need to use multiple variables in a date set to create a variable of your interest. Instead of calculating the value every time when you need to use the variable, it is more convenient to create such variable first and add the new variable to the data set.

# Add a new variable Diff.FT to the ANES data frame
ANES$Diff.FT <- ANES$GOP.Pres.cand.FT - ANES$Dem.Pres.cand.FT
head(ANES$Diff.FT,5)

## [1]  70  10  20  45 -15

Recode values

Change the class

A data set generally contains variables with different class/characteristics, such as numeric, character, factor and logical. When analyzing the data, sometimes you need to convert the class of the variable.

# check whether the variable is certain class
is.numeric()
is.character()
is.factor()
# convert to the correct class for the analysis
as.numeric()
as.character()
as.factor()

Change the value

When doing analysis, we often need to recode some of the values. For example, in survey data, the gender might be coded as “female” or “male” and we want to recode it as 0 and 1. We can use the recode function from the package car.

First, install and load the package Car:

# Install the car package, you only need to instal once
install.packages("car")

# Load the car package
library(car)

Second, try the recode function!

Change all the 1 into “female” and all the 0 into “male” in the female variable in ANES data

table(ANES$female)

## 
##    0    1 
## 1987 2232

ANES$female <- recode(ANES$female, "1 = 'female';0 = 'male'")

table(ANES$female)

## 
## female male 
## 2232 1987

Change a range of data into certain value: recode the feeling thermometer scores less than 25 into 1 (meaning “very negative”), 25 to 49 into 2 (meaning “somewhat negative”), 50 to 74 into 3 (meaning “somewhat positive”), and 75 to 100 into 4 (meaning “very positive”)

table(ANES$Dem.Pres.cand.FT)

## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  22  23  24  25  26  28  29  30  31  32  33  35  36 
## 648  79  73  21  10  14   5   5   7   1  15   4   6   5   4 317  44  25  13   8  13   1   1   7   6   1   4   5 221  28  16   6   3   2 
##  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 
##   5   3 202  23   3   2   5   5   1   3   4   2 181  16   5   3   3   7   2   1   2   8 272  25  12   2   3  16   4   1   3   7 300  31 
##  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##  10   8   6  19   2   7   1   2  13   4   3   5  14 340  32   7   2   1  28   1   2   6   6  19   3   9  14  20 260

ANES$Dem.Pres.cand.FT <- recode(ANES$Dem.Pres.cand.FT, "0:24=1; 25:49=2;50:74=3;75:100=4;else=NA")

table(ANES$Dem.Pres.cand.FT)

## 
##    1   2   3   4 
## 1326 551 928 820

After recoding the value, always double check the recoded variable

summary(ANES$Dem.Pres.cand.FT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   2.343  3.000    4.000    646

Sometimes, you might only want to change one specific cell. For example, you find the value for a NA cell. You can recode that cell by locating its row.

# For example, I find that the first missing value in the feeling thermometer towards Clinton should be 39
ANES$Dem.Pres.cand.FT[which(is.na(ANES$Dem.Pres.cand.FT))[1]] <- 39

Export Data

Note that the changes you made in R and R environment do not change the data in your folder. Therefore, after recoding or adding variables, you will want to save the updated data in your computer, so you do not need to repeat the data manipulation. Using the same packages, we can save/export the data you created or worked in R to your computer as various formats. However, csv format is recommended.

Save data as csv file

# write the data to the current working directory
write.csv(ANES,"updatedANES.csv")
# write the data to other folders in your computer
write.csv(ANES,"path/updatedANES.csv")

Save data as other format

# Excel
write.xlsx(ANES,"updatedANES.xlsx")
# Stata
write.dta(ANES,"updatedANES.dta")

Data Structures in R

There are different data structures in R. In this document, I briefly explain individual types. If you need more details, you can find them in R’s documentation.

R’s base data structures can be thought by their dimensionality (1 dimension, 2 dimensions, or N dimensions) and whether the contents are of the same type (homogeneous) or of the different types (heterogeneous). This gives rise to the five data types most often used in data analysis:

Dimension	Homogeneous	Heterogeneous
1d	Vector	List
2d	Matrix	Data frame
Nd	Array

Scalars and types of variables

Types of variables

Note that R has a scalar type as well. A scalar data structure is the most basic data type that holds only a single atomic value at a time. Scalars don’t have to be numeric(often called double), they can also be different types such as characters (i.e. strings), integers, or logical values. We can check the type of a variable by using the typeof() function:

typeof(1)

## [1] "double"

typeof("politics")

## [1] "character"

typeof(TRUE)

## [1] "logical"

Note that having quotation marks around a number will give you a character variable, instead of a numeric variable. For example,

typeof("1")

## [1] "character"

Factors

Asides from types, there is also a property called Attributes. Attributes can be thought of as a named list (with unique names), and can be accessed individually with attr() or all at once (as a list) with attributes().

One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class, “factor”, which makes them behave differently from regular integer vectors, and the levels, which defines the set of allowed values.

x <- factor(c("a", "b", "b", "a"))
x

## [1] a b b a
## Levels: a b

class(x)

## [1] "factor"

levels(x)

## [1] "a" "b"

Coercion

We can change the type of a variable to type x using the function as.x. This process is called “coercion”. For example, the following code changes the number 65 to the string “65”:

as.character(65)

## [1] "65"

typeof(65)

## [1] "double"

typeof(as.character(65))

## [1] "character"

Similarly, you can coerce one type to another by using as.character(), as.double(), as.integer(), or as.logical().

Vectors

The basic data structure in R is the vector, a 1-dimensional array whose entries are the same type.

Creation

The following code produces a vector containing the numbers 1, 3, 5, 7, and 9:

vec <- c(1,3,5,7,9)
vec

## [1] 1 3 5 7 9

We don’t have to type out all the numbers. The following code assigns a vector of the numbers from 1 to 100 to vec:

vec <- 1:100
vec

##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

What if I only want even numbers from 1 to 100 (inclusive)? We can manipulate vectors using arithmetic operations (just like numbers). Note that arithmetic operations happen element-wise.

even <- 1:50 * 2
even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72
## [37]  74  76  78  80  82  84  86  88  90  92  94  96  98 100

Or we can use seq() function:

even <- seq(2,100,2) # seq(start number, end number, by)
even

##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72
## [37]  74  76  78  80  82  84  86  88  90  92  94  96  98 100

We can also use c() function to combine (“concatenate”) several small vectors into one large vector.

z <- 1:5
z <- c(z,3,z)
z

##  [1] 1 2 3 4 5 3 1 2 3 4 5

Checking

We can check if a variable is of type vector, using is.vector() or is.atomic(). Other types of variables can be also checked using is.character(), is.double(), is.integer(), and is.logical().

is.vector(vec)

## [1] TRUE

is.atomic(vec)

## [1] TRUE

Use the length() function to figure out how many elements there are in a vector.

odd <-seq(1,99,2)

Extracting elements

We can get multiple elements of a vector as well. The following code extracts the 5th to 9th even number (inclusive), and assigns it to the variable y:

y <- even[5:9]
y

## [1] 10 12 14 16 18

This extracts just the 3rd and 5th even numbers:

even[c(3,5)]

## [1]  6 10

We can also erase certain numbers using negative indexing. Let’s say I want all even numbers except the first two:

even[-c(1,2)]

##  [1]   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [37]  78  80  82  84  86  88  90  92  94  96  98 100

Matrices and Arrays

Matrices are just the 2-dimensional analogs of vectors while arrays are the n-dimensional analogs of vectors. As with vectors, elements of matrices and arrays have to be of the same type. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

Creation

Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

a <- matrix(1:6, ncol = 3, nrow = 2)
a

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

b <- array(1:12, c(2, 3, 2))
b

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

c <- 1:6
dim(c) <- c(2, 3)
c

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Notice that R takes the elements in the vector you give it and fills in the matrix column by column. If we want the elements to be filled in by row instead, we have to put in a byrow = TRUE argument:

A <- matrix(1:6, nrow = 2, byrow=TRUE)
A

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Checking

To get the dimensions of the matrix, we can use the dim(), nrow() and ncol() functions.

length(a)

## [1] 6

nrow(a)

## [1] 2

ncol(a)

## [1] 3

You can change the names of row and column as well.

rownames(a)<-c("A","B")
colnames(a)<-c("a","b","c")
a

##   a b c
## A 1 3 5
## B 2 4 6

Extracting elements

To access the element in the ith row and jcolumn for the matrix A, use the index i,j:

A[1,2]

## [1] 2

A[1,]

## [1] 1 2 3

A[,1]

## [1] 1 4

Lists

In all the data structures so far, the elements have to be of the same type.

Creation

To have elements on different types in one data structure, we can use a list, which we create with list(). We can think of a list as a collection of key-value pairs. Keys should be strings.

event <- list(year = "2021", month = "Aug")
event

## $year
## [1] "2021"
## 
## $month
## [1] "Aug"

The str() function can be used to inspect what is inside person:

str(event)

## List of 2
##  $ year : chr "2021"
##  $ month: chr "Aug"

To access the year element person, we have 2 options:

event[["year"]]

## [1] "2021"

# or
event$year

## [1] "2021"

Checking

The elements of a list can be anything, even another data structure! Let’s add the Saturdays in August:

event$saturday <- c(7,4,21,28)
str(event)

## List of 3
##  $ year    : chr "2021"
##  $ month   : chr "Aug"
##  $ saturday: num [1:4] 7 4 21 28

To see the keys associated with a list, use the names() function:

names(event)

## [1] "year"     "month"    "saturday"

Data frames

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

You can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix). We will talk about subsetting later when we cover how to manipulate data in R.

Creation

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)

## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Checking

To check if an object is a data frame, use class() or test explicitly with is.data.frame():

class(df)

## [1] "data.frame"

is.data.frame(df)

## [1] TRUE

You can check the names of rows and columns.

colnames(df)

## [1] "x" "y"

rownames(df)

## [1] "1" "2" "3"

You can also check the numbers of rows and columns.

ncol(df)

## [1] 2

nrow(df)

## [1] 3

Coercion

You can coerce an object to a data frame with as.data.frame():

A vector will create a one-column data frame.
A list will create one column for each element; it’s an error if they’re not all the same length.
A matrix will create a data frame with the same number of columns and rows as the matrix.

vec<-c(1:5)
vec<-as.data.frame(vec)
str(vec)

## 'data.frame':    5 obs. of  1 variable:
##  $ vec: int  1 2 3 4 5

list<-list(1:2, 1:3, 1:4)
list<-as.data.frame(list)

## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4

matrix<-a
matrix<-as.data.frame(matrix)
str(matrix)

## 'data.frame':    2 obs. of  3 variables:
##  $ a: int  1 2
##  $ b: int  3 4
##  $ c: int  5 6

Getting started with R

To get started with R, you need to install them first. This document will show you how to download R as well as RStudio, a software application that makes R easier to use. Both R and RStudio are free and easy to download. After installation, we will talk about some basic features of RStudio.

Downlaod and Install R and RStudio

How to download and install R

R is maintained by an international team of developers who make the language available through the web page of The Comprehensive R Archive Network. To download R, please do the followings:

Open an internet browser and go to https://cran.r-project.org/
Follow the link that describes your operating system: Windows, Mac, or Linux.
- Windows: Click “Download R for Windows” > “base” > “Download R 4.1.1 for Windows”
- Mac: Click “Download R for Mac” > “R-4.1.1.pkg”
The link downloads an installer program, which installs the most up-to-date version of R for Windows. Run this program and step through the installation wizard that appears. The wizard will install R into your program files folders and place a shortcut in your Start menu. Note that you will need to have all of the appropriate administration privileges to install new software on your machine. If your computer requires a password before installing new programs, you will need it here. The installer lets you customize your installation, but the defaults will be suitable for most users.

How to download and install Rstudio

RStudio is an application that helps you write in R, making it easier to use R. You can download RStudio for free at https://www.rstudio.com/products/rstudio/download/#download. Just click the “Download RStudio” button and follow the simple instructions that follow. Once you have installed RStudio, you can open it like any other program on your computer—usually by clicking an icon on your desktop.

Even if you use RStudio, you will still need to download R to your computer. RStudio helps you use the version of R that lives on your computer, but it does not come with a version of R on its own.

Open and Use RStudio

Structure of RStudio

To open RStudio, you can double-click on the Rstudio icon in the Applications. A window titled “Rstudio” should pop up, looking something like this:

Source: Wikimedia Commons

1. Top-left: This space presents scripts you are working with. To add a new script, click the icon in the top-left corner of the window, and click “R Script”. You can type in commands here – it does not run the commands unless you execute code. To excute code from this window, highlight the code by dragging your mouse and click icon at the top of the window. You can also use a short-cut using your keyboard: press Cmd+Enter on a Mac, Ctrl+Enter on Windows.

2. Top-right: This is project windows showing Environment and History.

Environment: List of objects that we have created or have access to. We can also see the list of objects using the command ls().
History: List of commands that we have entered into the console.

3. Bottom-left: This is the R console, where your commands are run in an interactive fashion. Type in your command and hit the Enter key. Once you hit the Enter key, R executes your command and prints the result, if any.

4. Bottom-right: This space is where you see files, plots, packages, and help files.

Files: Allows you to navigate the directory structure on your computer.
Plots: Any graphical output you make will be displayed here.
Packages: List of packages that have been installed on your computer. To use a package, click the square box on the very left. To unload a package, click once again such that the check mark disappears.
Help: Documentation for functionName appears here when you type ?functionName in the console.
Viewer: For displaying local web content

R as a calculator

You can use R has a calculator. For example,

50+2

## [1] 52

305*13

## [1] 3965

408/2

## [1] 204

There are several other math functions which come with R. For example, to evaluate \(log(e^{12}-2^{cos(\pi)})\), we can do the following.

log(exp(12) - 2^(cos(pi)))

## [1] 12

Variable assignment

There are times we want to store the result of a computation so that we can use it later. R allows us to do this by variable assignment. When assigning variable names, they must start with a letter and can only contain letters, numbers, _ and .. Also, the names are case sensitive. Additionally, you can’t use any of the words like TRUE, NULL, if, and function. If you try to use those, you will get an error message. It is possible that you can override this rule by using backticks such as `_abc`, but it’s better to avoid such complicated name because it will get you confused.

Here is an example code assigning 5 to the variable x:

x<-5

You can use = sign to assign values to variable, but I would advise using <- to avoid unnecessary confusion. Also, when you assign values, no output will be printed. This is simply because the act of assigning value to a variable does not produce any output. If we want to check whether we assigned the value correctly, we simply type the variable’s name into the console:

## [1] 5

We can also reassign x to a different value:

x <- (x+3)^2
x

## [1] 64

We can also add other variables and define their relationships.

y <- x+6
z <- y^2 + x

Note that now we have x, y, z entries in our Environment tab. To remove object/variable, use the rm() function:

rm(x)

To remove more than one object, separate them by commas:

rm(y,z)

To remove all objects at once, type in rm(list = ls()) in the console.

PLSC 498: Analytical Political Economy

This page provides R workshop materials for the class. The goal of the workshop is to introduce R to undergraduate students and to familiarize students with R’s tools for data analysis.

Getting started with R: Installation and basics of R
Data Structures in R
Data Manipulation in R
Merging Data sets in R
Do your Analysis