Improving Prediction of Office Room Occupancy Through Random Sampling
In many cases, you may think that you have a Big Data problem, when in reality you just have a lot of data that a simple sampling can result in great accuracy. In todays blog, I decided to use office room occupancy dataset provided by”Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.” The dataset provided has 6 independent variables (predictors): date with timestamp; temperature of the room in Celsius; relative humidity in percent, light in Lux; CO2 in ppm, and humidity ratio or the ratio between temperature and humidity. The occupancy is a categorical variable with 2 levels: 0 for not occupied; and 1 for occupied. The occupancy has been measured every minutes, for the period of February 11, 2015 to February 18, 2015, and its dataset size is 9,752. The question I want to investigate is can a small random sample produce performance as good as large sample? For the model, I will build a Deep Feed Forward (DFF) Learning Model.
The occupancy dataset has a one minute interval timestamp for each of the 7 recoded days. I decided to remove it. While time may allows us to predict if a room is occupied, it is a “flowed” variable. The company may decide to have a party on weekend or after hour on a days not in our list, or a holiday may fall during business days. Furthermore, from the following table, one can see that the daily occupancy frequencies don’t match except for 2/14/15 and 2/15/15–which represent the weekend. Hence, the date is ignored during the occupancy modeling.
From the following table, one may come to the conclusion that we have unbalanced class problem– we have more records of non occupancy versus occupancy. In fact, as one explores the matrix scatterplot, it will be apparent that one is not dealing with unbalanced class but just we have a larger sample for non occupancy versus occupancy. In this example, in a 24 hour period, most people spend less time at work than outside work. Still this table, shows that 2/14/15, 2/15/15/, and 2/18/15 are not suitable as training set–given that the model is a NN model.
From the matrix scatterplot, especially looking at the box-plots comparing the predictors values between non occupancy and occupancy, we can see that except for humidity and humidity ratio, differences exist in temperature, in light, and in CO2 if the room is not occupied or occupied. From the last row in the matrix scatterplot, the distributions of temperature, light, and CO2 appear to differ if the room is occupied or not–confirming that that these variables should be kept. While not shown here, I did perform Wilcoxon Rank Sum Test to confirm the differences between the medians. Hence, I can remove humidity ratio, and even humidity. However, I decided to remove only humidity ratio. Furthermore, one notices a lot of outliers (the black dots.) If we assume 100% accuracy of measurement, outliers implies special cases not errors and they should not be discarded just o improve the model performance. Outliers, in this instance, also implies that a non-linear or non-parametric machine learning model will perform better than a linear model.
None of the distributions are normal, but light appears to be extremely skewed, and values beyond 1,000 implies occupancy. On the other hand, extreme CO2 or Humidity Ration implies the room is not occupied. While these observations are interesting, they are irrelevant when one wants the model to infer. Scaling and centering the variables is suggested in deep learning (help get to the global minimum) . In this case, I want to see if a good model can be produced even when the variables’ distributions are not scaled or centered.
Now, I have decided on the model, I also decided not to scale and center the variables.
Designing the Model:
I addition to trying building a Random Forest, and K Nearest Neighbor, I will also use a 3 hidden layers Neural Network.
The model I created has 5 hidden layers. The first 2 layers have 10 nodes, the third layer has 8 nodes, the forth layer has 6 nodes, and the last layer has 2 nodes. In all the layer, except the last layer, I used ReLu, Rectified Linear function, as the activation function. As I’m building a classifier, I have to use Softmax function to calculate the probability of occupancy and non occupancy. The model chooses the outcome with the highest probability.
The model uses Cross Entropy Loss function, and for training Stochastic Gradient Descent with initial learning rate of 0.001, momentum of 0.93, and a polynomial function for the scaling rate.
I will also build a Random Forest with 50 trees, and a K Nearest Neighbor with K equal 10.
Training the Model
In this problem, do we train on 1 day and test on another day, and validate on another? Does it matter if one chooses non sequential dates for testing and training set? For this last question, it will not matter, I have already decided that I will not consider time. For the first question, I will try it but with the test data provided in GitHub by the research paper. Understanding the context always helps speed up decisions. This looks promising, but from the previous table 2/14/15, 2/15/15, and 2/18/15 have none or very few occupancy, and cannot be used for training or testing the model. Hence, I’m left with only 5 potential days. Unfortunately, this approach is different from the approach I will present next, which make it incompatible for comparison with the following approach.
While the number of recorded non occupancy is much larger than the number of occupancy, one can see that in spite of outliers one can distinguish between occupancy and non occupancy by looking at the temperature, Light, and CO2. Hence, training and testing can be set so an equal number of occupancy and non occupancy are sampled. Now, the big question is the size of the training set and testing set. As the predictors are independent and they are all numerical, I should need at least 120 (or 30 * 4) observations for each level–meaning 120 for occupancy and 120 for non-occupancy. I have a large dataset, hence, I have decided to sample 200 for each level.
I sample 400 samples for each training set and testing set from the large one week dataset. Neither of the samples have same observations (very important to ensure no bias.) The validation set also has observations that are in neither sets (training or testing set.)
Performance with Sampling for 2/11/15 to 2/18/15
Deep Feed Forward
Our performance is very good with accuracy 98.44%. The model is even able to predict the 8 minutes the room was occupied in 2/18/15.–>
Using this model on the datatest.txt, containing 2/02/15 to 2/04/15, its performance was still good 97.94%. As one can see, the model was able to predict with 100% accuracy occupancy.
From this sample problem, one can see that having a lot of data doesn’t mean you need a large sample to accurately model them.
Date as a Sample
The rows represent the training set date, and the columns represent the testing set date. While the all performed well, using February 17 as the testing set and February 13 or February 12 gave the highest accuracy of 97.94%. Still the performance was not consistent for all days. The worst performing model with accuracy of 87.27 was the one using February 12 as training set, and February 13 as the testing. This implies choosing the most convenient sampling approach may not always provide the best accuracy.
Random Forest and K Nearest Neighbors
K Nearest Neighbors with K = 10 Accuracy 97.82%
Random Forest with 50 trees Accuracy of 97.86%
These results are consistent with the DFF, with accuracy of 97.94%.
These results are very impressive, knowing that the accuracy in the research paper was only 85%.
Discussion and Conclusion
Many assume they need a lot of data to build a model or make a prediction, that more is always better. In here, one can see with a sample of less than 10%, and little extra work–no need to figure out which day to get the best performance–all the models have a precision of over 97%. These results are impressive given that with a training set size of 8,143, the researchers were able to create a model with an accuracy of only 85%. The performance of the models is directly related to understanding the problem and the data at hand. While it appeared the dataset had unbalanced classes, the unbalance could be considered inconsequential. While the researchers kept timestamps to the minute, time was inconsequential. Pre-analysis of the data also helped detect unnecessary variable like humidity ratio. Furthermore, independence of the predictors allowed us to estimate the training and test sets size. In short, understanding the dataset will help speed up model building, but most importantly improve model prediction with little work.All the code is written in Mathematica. I will upload it to Github in the next few days.
Link: Improving Prediction of Office Room Occupancy Through Random Sampling