
Transposing a dataframe
You will sometimes be given a format that contains data that is arranged vertically and you want to flip it so that the variables are arranged horizontally. You will also hear this referred to as long format versus wide format. Most predictive analytics packages are set up to use long format, but there are often cases in which you want to switch rows with columns. Perhaps data is being input as a set of key pairs and you want to be able to map them to features for an individual entity. Also, this may be necessary with some time series data in which the data which comes in as long format needs to be reformatted so that the time periods appear horizontally.
Here is a data frame that consists of sales for each member for each month in the first quarter. We will use the text=' option of the read.table() function to read table data that we have pasted directly into the code. For example, this is from data that has been pasted directly from an Excel spreadsheet:
sales_vertical <- read.table(header=TRUE, text='
memberid Month sales
1 1 17
1 2 15
1 3 11
2 1 6
2 2 20
2 3 11
3 1 9
3 2 33
3 3 43
4 1 11
4 3 13
4 4 12
')
After running the preceding code, the sales_vertical command displays the data in the console similar to how it was coded in the read.table() function:
>sales_vertical
memberid Month sales
1 1 1 17
2 1 2 15
3 1 3 11
4 2 1 6
5 2 2 20
6 2 3 11
7 3 1 9
8 3 2 33
9 3 3 43
10 4 1 11
11 4 3 13
12 4 4 12
For switching the rows with the columns, we can use the spread function from the tidyr package:
install.packages("tidyr")
library(tidyr)
sales_horizontal <- spread(sales_vertical, Month, sales)
sales_horizontal
The last line prints the results of the transpose to the console. Take a look at the output of the console to verify that what used to be rows (memberids) are now columns. Each of the sales figures also appears as columns, with the column name designating the specific month. Additionally, for each month of data in which a sales figure did not appear for a particular month, NA will appear. For example, in the original data, member 12 was the only member who had sales figures for month 4. That will appear as 12, in the month 4 column, but since the other members had no sales for that month, they will appear as NA:
>sales_horizontal <- spread(sales_vertical, Month, sales)
>sales_horizontal
memberid 1 2 3 4
1 1 17 15 11 NA
2 2 6 20 11 NA
3 3 9 33 43 NA
4 4 11 NA 13 12