Intial steps for your Data Analysis

Before using R for the Data Analysis, you should know some key points to avoid getting many error messages. In this post, I will explain some of those key points.

Preparing Data for the Analysis

Variable names

  1. Since R is a Case Sensitive Language, variables ‘Age’ and ‘age’ will be treated as different variables in R. Therefore, you should select a common form to name all your variables. I usually use all simple for the variable names. May be you can use the first letter capital and then simple for all others. Avoid using longer names.
    Example: To specify the variable height you can either use the variable name as height or Height

  2. If you want to add a variable name with two or more words insert . or _ sign in between these wrods instead of a space.
    Example: To write the male height as a variable name you can use
    male.height, male_height or MaleHeight

  3. Do not include symbols such as ?, $,%, ^, &, *, (, ),-,#, ?,,,<,>, /, |, \, [ ,] ,{, and }
    to variable names.

Check your data

  1. If there are any missing values in your data set, indicate them as NA.

  2. If you use any specific R packages for the data analysis, check the examples given in the help file of that package. For example, if you want to use MASS package, run the code help(package="MASS"). If you want to understand lm (linear model) function in the MASS package run the code help(lm, package="MASS").
    In many help pages, there is an example which illustrates how the functions work. For example, if you want to execute examples relevant to lm (linear model) function run the command example(lm).

  3. One of the most important point is to understand the data format required to use the relevant package. Note that the functions given in the relevant package works only with this data format. Therefore, if your data set in not in that format, you have to reshape your data according to the required format.

  4. You may also need a particular data structure to use the relevant package.

  5. If you want to create a new variable, recode or rename variables.
    Refer:Quick R

  6. If you want to sort, subset or merge your data refer the following links: sort, subset merge

  7. To reshaping your data, read the following two suggestions given by STHDA website
    (i) Tidyr R package
    (ii) Tibble R package

Importing Data for the analysis

The next step is importing data to your R session. Before importing data, check whether your current directory is your working directory.

  1. Run the code getwd() to check your working directory, and if the current directory is not your working directory, set it as setwd("<path to your dataset>"). For example, if your data set is in D:\Rworks directory, use setwd("D:/Rworks").

  2. To read data files directly from your computer, use library("readr") and select files by runningdata <- read.delim(file.choose()) for txt files or data <- read.csv(file.choose()) for comma delimited (csv) files.

  3. You can also import files by specifying paths. For example, to import a data file in a directory D:\Rworks use
    (i) data <- read.csv("D:/Rworks/data.csv", header=TRUE, sep=",", row.names="id")
    for a comma delimited (csv) file.
    (ii) data<-read.delim("D:/Rworks/data.csv", header = TRUE, sep = "\t", dec = ".", ...)
    for a TAB delimited file.
    (iii) data <- read.table("D:/Rworks/data.csv", header = FALSE, sep = "", dec = ".")
    for tabular data.

  4. To read a txt file displays in a website use
    data<- read.table("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/test.txt",header = FALSE)
    You can also use read.delim(), and read.csv() as well for the relevant data formats.

  5. To read xlsx files use readxl package as below:
    library("readxl")
    data <- read_excel("data.xlsx").

  6. To read the first worksheet from the workbook named dataexcel.xlsx, use
    library(xlsx)
    data <- read.xlsx("c:/dataexcel.xlsx", 1)
    If you expect to read another sheet named “sheet4” use data <- read.xlsx("D:/Rworks/dataexcel.xlsx", sheetName = "sheet4").

Exporting Data in R

  1. To write data to a txt file having tab separated values, use write.table(data, file = "data.txt", sep = "\t", row.names = TRUE, col.names = NA).

  2. To write data to a comma delimited (csv) file, use
    write.csv(data, file = "data.csv").

  3. To write data to a new xlsx workbook, use
    library("xlsx")
    write.xlsx(data1, file = "dataworkbook1.xlsx",sheetName = "firstsheet", append = FALSE).
    Then add a second worksheet to the same workbook using write.xlsx(data2, file = "dataworkbook2.xlsx",sheetName="secondsheet", append=TRUE)

References
1. Quick R
2. STHDA