Importing csv and excel data (.xls and .xlsx) using readxl package into R

Read csv file and excel files (.xls and .xlsx) into R using ‘readxl’ package

Even today, most of the companies use Microsoft Excel for storing their information. It is convenient, data is formatted in a tabular fashion, and needs no training as most of the employees know excel very well.

Excel files can be stored either as excel workbook.(.xls , .xlsx) or as Comma Separated values (.csv).

Before importing data into R, it would be helpful to set working directory where files are stored. This avoids providing the entire path for reading each file. It can be done using the following command

#knowing current working directory
getwd()
#setting a new working directory
setwd("C:/...imaginary file path")

Notice the use of forward slash while providing working directory

Importing csv data into R

csv data can be read into a dataframe in R using the following command

mydata = read.csv('mydata.csv', header=TRUE)

Notice that as we have already set working directory, we need not provide the entire path where file is stored

Importing excel data into R
There are multiple packages to read excel data into R such as ‘xlsx’, ‘gdata’, ‘xlsReadWrite’ and the one which is illustrated here ‘readxl’

#install package readxl
install.packages('readxl')
#load package
library(readxl)
#read excel file (with .xls or .xlsx extension) into r dataframe
mydata = read_exce('myfile.xlsx', sheet=1) #reading first sheet
#for xls file replace .xlsx with .xls above

The advantage of ‘readxl’ package is that it has no external dependencies and is easy to install and use on all operating systems.

Splitting Data into Train and Test using caret package in R

Splitting data in R using sample function and caret package

Data is split into Train and Test in R to train the model and evaluate the results.

There are multiple ways of doing this.

1. Splitting data using sample function

#load data into variable called mydata
mydata = read.csv('mydata.csv',header=T)
#setting seed so we get same data split each time
set.seed(100) #can provide any number for seed
nall = nrow(mydata) #total number of rows in data
ntrain = floor(0.7 * nall) # number of rows for train,70%
ntest = floor(0.3* nall) # number of rows for test, 30%
index = seq(1:nall)
trainIndex = sample(index, ntrain) #train data set
testIndex = index[-train]

train = mydata[trainIndex,]
test = mydata[test,]

2. Splitting data using caret package

Data can be split in caret package based on the target variable, or y variable

For illustration, I am assuming target variable to be TARGET

#install caret package
install.packages('caret')
#load package
library(caret)
trainIndex = createDataPartition(mydata$TARGET, 
                       p=0.7, list=FALSE,times=1)

train = mydata[trainIndex,]
test = mydata[-trainIndex,]

There are more ways of splitting code. If you want to read more,
please refer to the following link
http://topepo.github.io/caret/splitting.html

Hello world in R

This post is to illustrate how to write a simple hello world program in R.

To do this download R from here and install it on your system.

R is available on Windows, Mac as well as Linux platforms.

Once you install R, open R and you should see something like this R-basic

Now go to File menu and select New Script In the new file, let us code to print out the introductory Hello World! by giving the following code

 print("Hello World") 

Now save your code as Hello World.R 

Congrats, you have successfully saved an R file. Now it’s time to execute.
Select the code in the file and hit Ctrl+R. You can see output in R console behind
as follows
hello-world-R

That’s it. You have successfully created your first R program and executed it successfully:)