Manipulate Data in R


In this document, you will learn how to prepare data using R. We start from setting a working directory, then learn how to install packages. We also look at how to import, view, manipulate, and Export data.

Before Loading Data

Set the Working Directory

Before loading data and start playing with it, you need to tell R where to find the data in your computer, i.e., the directory or the path. So before we go in further, let’s first see how can we set and change the working directory.

To set a directory, you use the setwd() function.

#view your current working directory
getwd()

#change the working directory to your project folder
setwd("/Users/username/RWork")
#view what is inside this working directory
dir()

**A note about directory/filenames in Windows:* In the Windows system, path names use backslashes, e.g. C:\Users\username\RWork\. R, on the other hand, inherits its file and folder naming conventions from Unix which uses forward slashes instead: C:/Users/username/RWork/.

Install packages

A package bundles together code, data, documentation, and tests, and is easy to share with others. In R’s CRAN (Comprehensive R Archive Network), there are huge variety of packages available for you to use. This is one of the reasons that R is so successful: the chances are that someone has already solved a problem that you’re working on, and you can benefit from their work by downloading their package.

To use those packages, you first need to install them. You only need to do this once. To install a package pkgName, simply type install.packages("pkgName") into the console.

install.packages("dplyr")

To use functions in the package, we first have to load the package: library(pkgName). If the package has not been installed yet, we will get an error. Otherwise, functions in the package are now available for use.

library(dplyr)

Import Data

Packages needed for different data formats

You can import all kinds of data format in R using various packages.

File Type Package Function
Excel (.xlsx) openslxs read.xlsx
Stata (.dta, versions 12 and earlier) foreign read.dta
Stata (.dta, versions 13 and 14) readstata13 read.dta13
SPSS (.sav) foreign read.spss
Comma separated (.csv) none read.csv
Tab delimited (.tab or .txt) none read.table
R data files (.Rdata) none load
#Install necessary packages, you only need to do this once
install.packages("openxlsx")
install.packages("foreign")
install.packages("readstata13")

Load packages and import data

Although you only need to install packages to R once, you need to load the required packages in R using the library() function before you analysis every time in the beginning of your R session.

1.Read Excel file

#You need to library() a package before using it in R. 
library(openxlsx) 

#If your data is in the current working directory, you can simply do:
data <- read.xlsx("filename.xlsx")

# If your data is not in the current working directory, you need to give R the path to the data, such as 
data <- read.xlsx("/Users/username/RWork/filename.xlsx") # do not run

2.Read a comma-seperated (.csv) file

# If your data is in the current working directory, you can simply do:
ANES <- read.csv("cleaned_ANES.csv")
# If your data is not in the current working directory, you need to give the path to the data, such as
ANES <- read.csv("/Users/username/RWork/cleaned_ANES.csv")

3.Read a R data (.Rdata) file

load("filename.Rdata")

4.Read a Stata (.dta, versions 12 and earlier) file

library(foreign)
data<-read.dta("filename.dta")

5.Read a Stata (.dta, versions 13 and 14) file

The “foreign” pack does not support stata data after version 12. We can use “readstat13” to import from the later versions of stata instead.

library(readstata13)
# Read from stata version 13&14
data<-read.dta13("filename.dta")

6.Read .sav (SPSS) file

library(foreign)
data <- read.spss("filename.sav", to.data.frame = TRUE)

7. Read a Tab deliminted (.tab or .txt) file

data <- read.table("JoP_R_data.txt",sep="\t")

Some useful arguments within read function

In some cases, you need to specify additional information of the data in the data importing function to be able to load the data in R in addition to the path and file name. You can get information about the arguments and their default in the R documents using the help() function.

#Practice and review what arguments are there inside different data import functions
?read.csv
?read.xlsx
?read.spss

Header

The header argument tells R whether to treat the first row of the data as the variable names. In the examples above, you probably have noticed that the default header argument in read.table is FALSE, while the default header argument in read.csv is TRUE. It means that by default, read.table() function will not treat the top row of the data as the variable name. If the first row of your data is the variable name, you’d want to change the default from FALSE to TRUE:

# If the first row of your data contains the variable names, you need to specify the header argument to TRUE in read.table. You do not need to specify the header to TRUE in read.csv, as it is the default in read.csv.
read.table(file, header = TRUE)
# If the first row of your csv data is not the variable names, you need to change the default of header in read.csv to FALSE
read.csv(file, header = FALSE)

sep

The sep argument tell R how to separate the values in the raw data. The default of read.table is sep=“”, which tells R to separate values on one or more white spaces. For example, under this default, “Penn State” will be treated as two values, and will be put into two columns in a data frame.

If your data is not white space separated, you need to change this default accordingly. In the “Penn State” example, the “Penn State” might be one variable, and should be put into one column instead of two. In this case, the data is not separated by white space, but tabs. If your data is tab separated, you need to change the sep argument to \(“\t”\):

read.table(file, sep="\t")

If your data is comma separated, the sep argument should be “,” which is the default in read.csv().

to.data.frame in read.spss

When you read a SPSS data into R, the read.spss() function does not change the format of the data as a data frame. However, as most of our analyses are on a data frame, you’d want to change this default when you read a SPSS file:

read.spss(file, to.data.frame = TRUE)

Get existing data

There are multiple ways you can get data into R. First, there are data included in packages. Second, you can find datasets online and introduce a resource with a lot of links to political datasets. Third, you can load data from GitHub repositories.

Using data from data packages

A lot of packages include dataset. For example, the package poliscidata provide General Social Survey, American National Election Survey, and States dataset.

install.packages("poliscidata")
library(poliscidata)

Using specific packages to load data

Some data are directly downloadable to R using specific packages. For example, to get data from the General Social Survey, you can use the gssr package (Kieran healy, 2019). This package is not on CRAN, so we need to use the devtools package to install it and then use library() to load it. The easiest way to get the GSS data is to simply ask to get all of the data in an object called gss_all with data(). This loads 6108 variables contained in the cumulative GSS data.

devtools::install_github("kjhealy/gssr")
library("gssr")
data(gss_all)

There are several other functions that might be help conditional upon what you would like to study. Accordingly, we recommend that you consult https://kjhealy.github.io/gssr/ for various examples on how to access the data and the documentation of different variables.

Similarly, you can obtain the American National Election Study data, using the package anesr(Martherus, 2019). This package is also not available on CRAN for now, so we need to use install_github() function from the devtools pakcage. After that, we can lod the package.

devtools::install_github("jamesmartherus/anesr")
library("anesr")

To see all the datasets available via the package, simply type the following:

data(package="anesr")

If you run the above code, you will see that there is a dataset called timeseries_2020. To access to this data, use the data() function. This returns an object called timeseries_2020 with 8280 observations and 1381 variables. For further information, see https://github.com/jamesmartherus/anesr.

data(timeseries_2020)

As you may see from the data set, you there are too many variables that may not be of your interests. Also, the names of variables are not informative at all – they are marked as a combination of “V” and some numbers. So before using any publicly available data, I encourage you to check the documentations for data sets and use only variable they you need.

Load data from GitHub repositories

A lot of data today is available in GitHub repositories. To get data from GitHub, we can first set the file path (click on ‘Raw’ to get to the raw dataset) and the use read.csv() function.

urlfile<-'https://raw.githubusercontent.com/erikgahner/polls/master/polls.csv'
polls<-read.csv(url(urlfile))

View data

In addition to the View() function, there are many other useful functions for getting information on your data.

View some basic features

# list all the objects in your current working environment
ls()
# Load dataset from GitHub
urlfile<-'https://raw.githubusercontent.com/boyoon0715/R_Workshops/main/cleaned_ANES.csv'
ANES<-read.csv(url(urlfile))
# list the variable names
names(ANES)
##  [1] "caseid"              "weight_full"         "mode"                "voted2012"           "regist2016"          "primary2016"        
##  [7] "intend.vote"         "ideology"            "pid_strength"        "born_again"          "fundamentalist"      "race"               
## [13] "female"              "income"              "hannity"             "minutes60"           "cooper"              "matthews"           
## [19] "press"               "conan"               "big.bang"            "modern.family"       "rush"                "NPR"                
## [25] "voted2016"           "vote.choice2016"     "Dem.Pres.cand.FT"    "GOP.Pres.cand.FT"    "Lib.Pres.cand.FT"    "Green.Pres.cand.FT" 
## [31] "Dem.House.cand.FT"   "GOP.House.cand.FT"   "Ind.House.cand.FT"   "Dem.Sen.cand.FT"     "GOP.Sen.cand.FT"     "Ind.Sen.cand.FT"    
## [37] "Sr.Sen.FT"           "Jr.Sen.FT"           "Nonrun.Sen.FT"       "Dem.VP.cand.FT"      "GOP.VP.cand.FT"      "Roberts.FT"         
## [43] "Pope.FT"             "Christian.Fund.FT"   "Fem.FT"              "Liberals.FT"         "Labor.Unions.FT"     "Poor.FT"            
## [49] "Big.Business.FT"     "Conservatives.FT"    "SCt.FT"              "Gays.FT"             "Congress.FT"         "Rich.FT"            
## [55] "Muslims.FT"          "Christians.FT"       "Jews.FT"             "Tea.Party.FT"        "Police.FT"           "Trans.FT"           
## [61] "Scientists.FT"       "BLM.FT"              "tax.mill"            "vaccinces.in.school" "vaccines"            "Govt.Reduce.Eq"     
## [67] "Req.Equal"           "how.equal"           "Foreign.Im"          "immig.support"       "Free.Trade.Ag"       "marijuana"          
## [73] "Asians.FT"           "Hispanics.FT"        "Blacks.FT"           "Illegals.FT"         "Whites.FT"           "location"           
## [79] "educ"                "GOP"                 "Dem"                 "employed"            "cath_attend"         "prot_attend"        
## [85] "jew_attend"          "cath_nonatt"         "prot_nonatt"         "jew_nonatt"          "Obama12"             "Romney12"           
## [91] "yes.vote2016"        "Trump2016"           "Clinton2016"
# dimension of your data
dim(ANES)
## [1] 4271   93
# view the first 2 rows in the data
head(ANES,n=2)
# view the last 2 rows in the data
tail(ANES,n=2)
# class of an object
class(ANES)
## [1] "data.frame"
class(ANES$NPR)
## [1] "integer"
# view summary statistics of a variable
summary(ANES$NPR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1452  0.0000  1.0000    1999

Subset data

# hand pick rows and columns of a data
ANES[c(1,2),c(2,3)] # first two rows and second and third column
# exclude certain rows and columns
ANES[-1, -c(1,2,3)] # exclude the first row and first three columns 
# select more than one variables
ANES[c("NPR","rush","hannity")]
# select based on the values of observations
ANES[which(ANES$NPR==1),]
# select based on multiple critieria
ANES[which(ANES$Dem.Pres.cand.FT >=20 & ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer scores towards Clinton AND Trump are BOTH bigger or equal to 20 
ANES[which(ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20),] # select the respondents whose feeling thermometer score towards Clinton OR Trump is bigger or equal to 20 
# subset use the subset function
subset(ANES, ANES$Dem.Pres.cand.FT >=20 | ANES$GOP.Pres.cand.FT >= 20) 

Sort data

You can sort your data on one or more variables, and you can set the order to be ascending or descending

# the order function returns a permutation which rearranges its first argument into ascending or descending order. The default is ascending. 
order(ANES$Dem.Pres.cand.FT)
# change into descending order using the decreasing argument inside order()
order(ANES$Dem.Pres.cand.FT, decreasing = TRUE)
# Sort the turnout data based on the ascending order of ANES 
ordered.ANES <- ANES[order(ANES$Dem.Pres.cand.FT,decreasing = TRUE),]
# Note after ordering, the NAs will always be at the end no matter whether the order is ascending or descending

Manipulate data

The data sets you have been using here (ANES) have already been cleaned. They are in the correct formats and contains everything you need for the class analysis. However, in reality when you use other people’s data or download data online, you might need to use several variables in the data to calculate the variable you are interested in, and you might need to correct the class of a variable, and you might need to recode the values in the data.

Create new variables

Some times, you need to use multiple variables in a date set to create a variable of your interest. Instead of calculating the value every time when you need to use the variable, it is more convenient to create such variable first and add the new variable to the data set.

# Add a new variable Diff.FT to the ANES data frame
ANES$Diff.FT <- ANES$GOP.Pres.cand.FT - ANES$Dem.Pres.cand.FT
head(ANES$Diff.FT,5)
## [1]  70  10  20  45 -15

Recode values

Change the class

A data set generally contains variables with different class/characteristics, such as numeric, character, factor and logical. When analyzing the data, sometimes you need to convert the class of the variable.

# check whether the variable is certain class
is.numeric()
is.character()
is.factor()
# convert to the correct class for the analysis
as.numeric()
as.character()
as.factor()

Change the value

When doing analysis, we often need to recode some of the values. For example, in survey data, the gender might be coded as “female” or “male” and we want to recode it as 0 and 1. We can use the recode function from the package car.

First, install and load the package Car:

# Install the car package, you only need to instal once
install.packages("car")

# Load the car package
library(car)

Second, try the recode function!

  • Change all the 1 into “female” and all the 0 into “male” in the female variable in ANES data
table(ANES$female)
## 
##    0    1 
## 1987 2232
ANES$female <- recode(ANES$female, "1 = 'female';0 = 'male'")
table(ANES$female)
## 
## female male 
## 2232 1987 
  • Change a range of data into certain value: recode the feeling thermometer scores less than 25 into 1 (meaning “very negative”), 25 to 49 into 2 (meaning “somewhat negative”), 50 to 74 into 3 (meaning “somewhat positive”), and 75 to 100 into 4 (meaning “very positive”)
table(ANES$Dem.Pres.cand.FT)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  22  23  24  25  26  28  29  30  31  32  33  35  36 
## 648  79  73  21  10  14   5   5   7   1  15   4   6   5   4 317  44  25  13   8  13   1   1   7   6   1   4   5 221  28  16   6   3   2 
##  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71 
##   5   3 202  23   3   2   5   5   1   3   4   2 181  16   5   3   3   7   2   1   2   8 272  25  12   2   3  16   4   1   3   7 300  31 
##  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##  10   8   6  19   2   7   1   2  13   4   3   5  14 340  32   7   2   1  28   1   2   6   6  19   3   9  14  20 260
ANES$Dem.Pres.cand.FT <- recode(ANES$Dem.Pres.cand.FT, "0:24=1; 25:49=2;50:74=3;75:100=4;else=NA")
table(ANES$Dem.Pres.cand.FT)
## 
##    1   2   3   4 
## 1326 551 928 820 
  • After recoding the value, always double check the recoded variable
summary(ANES$Dem.Pres.cand.FT)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   2.000   2.343  3.000    4.000    646 

Sometimes, you might only want to change one specific cell. For example, you find the value for a NA cell. You can recode that cell by locating its row.

# For example, I find that the first missing value in the feeling thermometer towards Clinton should be 39
ANES$Dem.Pres.cand.FT[which(is.na(ANES$Dem.Pres.cand.FT))[1]] <- 39

Export Data

Note that the changes you made in R and R environment do not change the data in your folder. Therefore, after recoding or adding variables, you will want to save the updated data in your computer, so you do not need to repeat the data manipulation. Using the same packages, we can save/export the data you created or worked in R to your computer as various formats. However, csv format is recommended.

Save data as csv file

# write the data to the current working directory
write.csv(ANES,"updatedANES.csv")
# write the data to other folders in your computer
write.csv(ANES,"path/updatedANES.csv")

Save data as other format

# Excel
write.xlsx(ANES,"updatedANES.xlsx")
# Stata
write.dta(ANES,"updatedANES.dta")