After putting together an electronic device to monitor the temperature of my refrigerator, I discovered that it didn't work consistently. It would run, crash, restart… and each time it would start a new file. Still, I wanted to know if some basic things were working. Specifically, did each file start at 1 second, and how long the files run before crashing.
Although this isn't the most exciting example in the world, it does show some nice features of the data.table package. Especially, the rbindlist function.
rm(list = ls())
library(data.table)
## The following line is needed to compile to HTML, please ignore
if (basename(getwd()) != "data.table_demo") {
setwd("..")
}
## LIST FILES
csvfiles = list.files(path = "data/data logger results/", full.names = TRUE,
pattern = "[Cc][Ss][Vv]")
csvfiles
## [1] "data/data logger results/LOGGER01.CSV"
## [2] "data/data logger results/LOGGER02.CSV"
## [3] "data/data logger results/LOGGER03.CSV"
## [4] "data/data logger results/LOGGER04.CSV"
## [5] "data/data logger results/LOGGER05.CSV"
## [6] "data/data logger results/LOGGER06.CSV"
## [7] "data/data logger results/LOGGER07.CSV"
## [8] "data/data logger results/LOGGER08.CSV"
## [9] "data/data logger results/LOGGER09.CSV"
## [10] "data/data logger results/LOGGER10.CSV"
## [11] "data/data logger results/LOGGER11.CSV"
## [12] "data/data logger results/LOGGER12.CSV"
## [13] "data/data logger results/LOGGER13.CSV"
## [14] "data/data logger results/LOGGER14.CSV"
Notice that all the files have different sizes
## The following line is needed to compile to HTML, please ignore
if (basename(getwd()) != "data.table_demo") {
setwd("..")
}
dat = lapply(csvfiles, read.csv)
str(dat, 1)
## List of 14
## $ :'data.frame': 18 obs. of 6 variables:
## $ :'data.frame': 5393 obs. of 6 variables:
## $ :'data.frame': 7 obs. of 6 variables:
## $ :'data.frame': 3 obs. of 6 variables:
## $ :'data.frame': 17 obs. of 6 variables:
## $ :'data.frame': 12 obs. of 6 variables:
## $ :'data.frame': 201 obs. of 6 variables:
## $ :'data.frame': 76 obs. of 6 variables:
## $ :'data.frame': 36076 obs. of 6 variables:
## $ :'data.frame': 175 obs. of 6 variables:
## $ :'data.frame': 7119 obs. of 6 variables:
## $ :'data.frame': 2633 obs. of 6 variables:
## $ :'data.frame': 38196 obs. of 6 variables:
## $ :'data.frame': 50 obs. of 6 variables:
## ADD FILE NAME
for (i in 1:length(csvfiles)) {
dat[[i]]$file = basename(csvfiles[i])
}
str(dat, 1)
## List of 14
## $ :'data.frame': 18 obs. of 7 variables:
## $ :'data.frame': 5393 obs. of 7 variables:
## $ :'data.frame': 7 obs. of 7 variables:
## $ :'data.frame': 3 obs. of 7 variables:
## $ :'data.frame': 17 obs. of 7 variables:
## $ :'data.frame': 12 obs. of 7 variables:
## $ :'data.frame': 201 obs. of 7 variables:
## $ :'data.frame': 76 obs. of 7 variables:
## $ :'data.frame': 36076 obs. of 7 variables:
## $ :'data.frame': 175 obs. of 7 variables:
## $ :'data.frame': 7119 obs. of 7 variables:
## $ :'data.frame': 2633 obs. of 7 variables:
## $ :'data.frame': 38196 obs. of 7 variables:
## $ :'data.frame': 50 obs. of 7 variables:
rbindlistdatAll = rbindlist(dat)
Now just print the data object itself to see the head and tail:
datAll
## millis.1000 sens0 sens1 sens2 sens3 sens4 file
## 1: 1 0 0 153 116 75 LOGGER01.CSV
## 2: 2 0 0 151 118 71 LOGGER01.CSV
## 3: 3 0 0 153 119 75 LOGGER01.CSV
## 4: 4 0 0 153 117 72 LOGGER01.CSV
## 5: 5 0 0 152 117 72 LOGGER01.CSV
## ---
## 89972: 46 0 0 138 110 74 LOGGER14.CSV
## 89973: 47 0 0 138 110 74 LOGGER14.CSV
## 89974: 48 0 0 138 110 75 LOGGER14.CSV
## 89975: 49 0 0 138 110 75 LOGGER14.CSV
## 89976: 50 0 0 137 110 75 LOGGER14.CSV
data.table examplesmillis.1000” should be “seconds” setnames will fix that## Change to 'seconds'
setnames(datAll, "millis.1000", "seconds")
colnames(datAll)
## [1] "seconds" "sens0" "sens1" "sens2" "sens3" "sens4" "file"
## NOTE: The alternative expression still works (with a warning), if you
## prefer. But who wants to read that??
## colnames(datAll)[which(colnames(datAll)=='millis.1000')] = 'seconds'
First, let's look at some simple plots and tables to make sure that we know what we've got.
## Plot to see what the index values look like Seems like each file starts
## at one and marches through time linearly... which is what it should do.
indx = 1:nrow(datAll)
plot(seconds ~ indx, data = datAll, col = factor(file), main = "Time index by file")
## Seeing how many starts we had, when seconds==1
datAll[seconds == 1]
## seconds sens0 sens1 sens2 sens3 sens4 file
## 1: 1 0 0 153 116 75 LOGGER01.CSV
## 2: 1 0 0 154 114 72 LOGGER02.CSV
## 3: 1 0 0 149 111 73 LOGGER03.CSV
## 4: 1 0 0 149 111 73 LOGGER04.CSV
## 5: 1 0 0 149 111 73 LOGGER05.CSV
## 6: 1 0 0 146 119 73 LOGGER06.CSV
## 7: 1 0 0 146 118 74 LOGGER07.CSV
## 8: 1 0 0 149 115 71 LOGGER08.CSV
## 9: 1 0 0 141 109 74 LOGGER09.CSV
## 10: 1 0 0 142 109 73 LOGGER10.CSV
## 11: 1 0 4 141 116 74 LOGGER11.CSV
## 12: 1 0 0 149 112 73 LOGGER12.CSV
## 13: 1 0 0 153 119 68 LOGGER13.CSV
## 14: 1 0 0 138 109 74 LOGGER14.CSV
## This shows us that it was the first element that had 'seconds' == 1
## when we group the data by 'file'
datAll[, j = which(seconds == 1), by = file]
## file V1
## 1: LOGGER01.CSV 1
## 2: LOGGER02.CSV 1
## 3: LOGGER03.CSV 1
## 4: LOGGER04.CSV 1
## 5: LOGGER05.CSV 1
## 6: LOGGER06.CSV 1
## 7: LOGGER07.CSV 1
## 8: LOGGER08.CSV 1
## 9: LOGGER09.CSV 1
## 10: LOGGER10.CSV 1
## 11: LOGGER11.CSV 1
## 12: LOGGER12.CSV 1
## 13: LOGGER13.CSV 1
## 14: LOGGER14.CSV 1
## The .N is very useful!! Totals by file
datAll[, .N, by = file]
## file N
## 1: LOGGER01.CSV 18
## 2: LOGGER02.CSV 5393
## 3: LOGGER03.CSV 7
## 4: LOGGER04.CSV 3
## 5: LOGGER05.CSV 17
## 6: LOGGER06.CSV 12
## 7: LOGGER07.CSV 201
## 8: LOGGER08.CSV 76
## 9: LOGGER09.CSV 36076
## 10: LOGGER10.CSV 175
## 11: LOGGER11.CSV 7119
## 12: LOGGER12.CSV 2633
## 13: LOGGER13.CSV 38196
## 14: LOGGER14.CSV 50
The above code tells us how many items we had in each group of files, and that each file started with “1” in the seconds field. We could also ask what's the max and min of “seconds” in each file group to get the same answer, and confirm that the timer (probably) didn't skip any seconds… if the count by file == the max seconds by file I think it's safe to assume that the timer is at least working.
I'm going to do the query the old way, using data frames, then using data tables.
## Data frame approach 1
with(datAll, tapply(seconds, file, range))
## $LOGGER01.CSV
## [1] 1 18
##
## $LOGGER02.CSV
## [1] 1 5393
##
## $LOGGER03.CSV
## [1] 1 7
##
## $LOGGER04.CSV
## [1] 1 3
##
## $LOGGER05.CSV
## [1] 1 17
##
## $LOGGER06.CSV
## [1] 1 12
##
## $LOGGER07.CSV
## [1] 1 201
##
## $LOGGER08.CSV
## [1] 1 76
##
## $LOGGER09.CSV
## [1] 1 36076
##
## $LOGGER10.CSV
## [1] 1 175
##
## $LOGGER11.CSV
## [1] 1 7119
##
## $LOGGER12.CSV
## [1] 1 2633
##
## $LOGGER13.CSV
## [1] 1 38196
##
## $LOGGER14.CSV
## [1] 1 50
## Data frame approach 2 (cleaned up)
do.call(rbind, with(datAll, tapply(seconds, file, range)))
## [,1] [,2]
## LOGGER01.CSV 1 18
## LOGGER02.CSV 1 5393
## LOGGER03.CSV 1 7
## LOGGER04.CSV 1 3
## LOGGER05.CSV 1 17
## LOGGER06.CSV 1 12
## LOGGER07.CSV 1 201
## LOGGER08.CSV 1 76
## LOGGER09.CSV 1 36076
## LOGGER10.CSV 1 175
## LOGGER11.CSV 1 7119
## LOGGER12.CSV 1 2633
## LOGGER13.CSV 1 38196
## LOGGER14.CSV 1 50
## Data table approach 1 (very simple)
datAll[, range(seconds), by = file]
## file V1
## 1: LOGGER01.CSV 1
## 2: LOGGER01.CSV 18
## 3: LOGGER02.CSV 1
## 4: LOGGER02.CSV 5393
## 5: LOGGER03.CSV 1
## 6: LOGGER03.CSV 7
## 7: LOGGER04.CSV 1
## 8: LOGGER04.CSV 3
## 9: LOGGER05.CSV 1
## 10: LOGGER05.CSV 17
## 11: LOGGER06.CSV 1
## 12: LOGGER06.CSV 12
## 13: LOGGER07.CSV 1
## 14: LOGGER07.CSV 201
## 15: LOGGER08.CSV 1
## 16: LOGGER08.CSV 76
## 17: LOGGER09.CSV 1
## 18: LOGGER09.CSV 36076
## 19: LOGGER10.CSV 1
## 20: LOGGER10.CSV 175
## 21: LOGGER11.CSV 1
## 22: LOGGER11.CSV 7119
## 23: LOGGER12.CSV 1
## 24: LOGGER12.CSV 2633
## 25: LOGGER13.CSV 1
## 26: LOGGER13.CSV 38196
## 27: LOGGER14.CSV 1
## 28: LOGGER14.CSV 50
## file V1
## Data table approach 2
datAll[, list(min(seconds), max(seconds)), by = file]
## file V1 V2
## 1: LOGGER01.CSV 1 18
## 2: LOGGER02.CSV 1 5393
## 3: LOGGER03.CSV 1 7
## 4: LOGGER04.CSV 1 3
## 5: LOGGER05.CSV 1 17
## 6: LOGGER06.CSV 1 12
## 7: LOGGER07.CSV 1 201
## 8: LOGGER08.CSV 1 76
## 9: LOGGER09.CSV 1 36076
## 10: LOGGER10.CSV 1 175
## 11: LOGGER11.CSV 1 7119
## 12: LOGGER12.CSV 1 2633
## 13: LOGGER13.CSV 1 38196
## 14: LOGGER14.CSV 1 50
## Data table approach 3
datAll[i = TRUE, j = list(min = min(seconds), max = max(seconds)), by = file]
## file min max
## 1: LOGGER01.CSV 1 18
## 2: LOGGER02.CSV 1 5393
## 3: LOGGER03.CSV 1 7
## 4: LOGGER04.CSV 1 3
## 5: LOGGER05.CSV 1 17
## 6: LOGGER06.CSV 1 12
## 7: LOGGER07.CSV 1 201
## 8: LOGGER08.CSV 1 76
## 9: LOGGER09.CSV 1 36076
## 10: LOGGER10.CSV 1 175
## 11: LOGGER11.CSV 1 7119
## 12: LOGGER12.CSV 1 2633
## 13: LOGGER13.CSV 1 38196
## 14: LOGGER14.CSV 1 50
Notice that after cleaning up the code I ended up with something that seems a little verbose:
datAll[i = TRUE, j = list(min = min(seconds), max = max(seconds)), by = file]
To make a nice table the old way it takes several steps. I would argue that the data.table solution is much more readable than either of these (and it's definitely much faster):
## NOT EVALUATED (also has no column names)
do.call(rbind, with(datAll, tapply(seconds, file, range)))
## NOT EVALUATED (also has no column names)
aggregate(datAll$seconds, by = list(file = datAll$file), function(x) c(min(x),
max(x)))
## NOT EVALUATED
data.frame(file = aggregate(datAll$file, by = list(file = datAll$file), "[",
1)$x, min = aggregate(datAll$seconds, by = list(file = datAll$file), min)$x,
max = aggregate(datAll$seconds, by = list(file = datAll$file), max)$x, stringsAsFactors = FALSE)
Temperatures are converted to Fahrenheit, but the temperatures don't make much sense. The values for Sens3 are especially suspect since it's not a calibrated instrument, but the other two are. The freezer was in fact freezing, so the high temperatures don't make sense.
Also, I know that the pressure sensor was pressed more than once!!
plot(datAll$sens0, main = "pressure sensor tab (pressed randomly)")
plot(datAll$sens1, main = "light sensor (means fridge is open)")
plot(datAll$sens2/10 * 9/5 + 32, main = "fridge sensor")
plot(datAll$sens3/10 * 9/5 + 32, main = "external sensor (not a calibrated sensor)")
plot(datAll$sens4/10 * 9/5 + 32, main = "freezer sensor")
boxplot(sens2 ~ file, datAll)
boxplot(sens3 ~ file, datAll)
boxplot(sens4 ~ file, datAll)