What can R do about a messy data format

  • A+
Category:Languages

Sometimes I see data posted in a SO question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.
I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+ 

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice but the result is not good.

Question: What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

 


The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.

The first step after copying & pasting the table into an R session is to get rid of the horizontal lines and read it in with read.table setting the header, sep, comment.char and strip.white arguments.

Credits for reminding me of arguments comment.char and strip.white go to @nicola, and his comment.

dat <- read.table(text = " +------------+------+------+----------+--------------------------+ |    Date    | Emp1 | Case | Priority | PriorityCountinLast7days | +------------+------+------+----------+--------------------------+ | 2018-06-01 | A    | A1   |        0 |                        0 | | 2018-06-03 | A    | A2   |        0 |                        1 | | 2018-06-03 | A    | A3   |        0 |                        2 | | 2018-06-03 | A    | A4   |        1 |                        1 | | 2018-06-03 | A    | A5   |        2 |                        1 | | 2018-06-04 | A    | A6   |        0 |                        3 | | 2018-06-01 | B    | B1   |        0 |                        1 | | 2018-06-02 | B    | B2   |        0 |                        2 | | 2018-06-03 | B    | B3   |        0 |                        3 | +------------+------+------+----------+--------------------------+ ", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE) 

But as you can see there are some issues with the result.

dat    X       Date Emp1 Case Priority PriorityCountinLast7days X.1 1 NA 2018-06-01    A   A1        0                        0  NA 2 NA 2018-06-03    A   A2        0                        1  NA 3 NA 2018-06-03    A   A3        0                        2  NA 4 NA 2018-06-03    A   A4        1                        1  NA 5 NA 2018-06-03    A   A5        2                        1  NA 6 NA 2018-06-04    A   A6        0                        3  NA 7 NA 2018-06-01    B   B1        0                        1  NA 8 NA 2018-06-02    B   B2        0                        2  NA 9 NA 2018-06-03    B   B3        0                        3  NA 

To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.

So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.

dat <- dat[-c(1, ncol(dat))] dat           Date   Emp1   Case Priority PriorityCountinLast7days 1  2018-06-01   A      A1           0                        0 2  2018-06-03   A      A2           0                        1 3  2018-06-03   A      A3           0                        2 4  2018-06-03   A      A4           1                        1 5  2018-06-03   A      A5           2                        1 6  2018-06-04   A      A6           0                        3 7  2018-06-01   B      B1           0                        1 8  2018-06-02   B      B2           0                        2 9  2018-06-03   B      B3           0                        3 

That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.

dat$Date <- as.Date(dat$Date) 

And the result is satisfactory.

str(dat) 'data.frame':   9 obs. of  5 variables:  $ Date                    : Date, format: "2018-06-01" "2018-06-03" ...  $ Emp1                    : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2  $ Case                    : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9  $ Priority                : int  0 0 0 1 2 0 0 0 0  $ PriorityCountinLast7days: int  0 1 2 1 1 3 1 2 3 

Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.

The whole process took only 3 lines of base R code.

Finally, the end result in dput format, like it should be in the first place.

dat <- structure(list(Date = structure(c(17683, 17685, 17685, 17685,  17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",  "A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",  "A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,  0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,  1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame") 

Comment

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: