In the first part of this series of posts, I looked at some initial data from insiderairbnb in Seattle and used a simple decision tree model to develop a basic understanding of the data and the relationships within it. Today I am going to talk about using a dataset with some expanded data, how we learn as a function of the number of examples within the dataset, how a more complicated model can help us greatly increase the accuracy of our predictions in this particular case and how variable importance changes when we move to a more robust model.

–

The initial dataset I used for this project was the “listings.csv” under the Seattle header, I thought that the contents were the same as the “listings.csv.gz” dataset – as they share the same name – but it turns out that the later contains a lot of additional variables that do not exist in the first set. This expanded dataset contains information such as the number of bedrooms, bathrooms and the 30/60/90 availability information. The database does require significant additional parsing though, as many numerical variables use “,” for indicating thousands and contain “$” symbols when price values are used. Below is the list of variables I ended up using for making predictions:

–

'neighbourhood', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365'

–

The overall data composition is the same though, so the probability to classify an above average value for “reviews_per_month” is still 60%. Using the decision tree model discussed in the previous post I achieve an accuracy of 72% in 10-fold stratified cross-validation, better thanks to the additional variables now included. However, I thought I might try a more complicated model to see how both variable importance and prediction accuracy change.

The first image above shows you the learning curve for a LightGBM model trained on this same data. We clearly see that our learning in cross-validation continuously improves as we get more data, only reaching a plateau towards the end of the sample size. The convergence of our training and testing curves plus this tendency to learn more as a function of the amount of data does imply that our model is generalizing well, without significantly over-fitting. This model is also better at making predictions – as it uses boosting to improve predictive ability – taking our accuracy from 72 to 79% in stratified cross validation.

–

–

The importance of variables, in the second image in this post, also changes significantly when we go from a simple decision tree to LightGBM. The latitude and longitude suddenly become the most important variables – location, location, location – followed by the price of the rental. The cleaning fee, a variable that wasn’t present in the first dataset I looked at, also becomes increasingly important. However it is worth noting that these relationships are still heavily non-linear and changes in the variables can have unexpected effects in the probability to have higher than average reviews per month. The third image in this post shows you an example of this, where lowering prices actually lowers the probability to have above average reviews while increasing price increases this probability up to a point.

Given that a LightGBM model can be so successful as a classifier for “above average reviews per month” –* with an accuracy of almost 80%* – I wonder if we could actually build a successful regressor to tackle this problem. A regressor would be very useful since we would actually be able to see the specifically predicted average reviews per month as a function of the variation of any number of parameters we desire. With this in mind we could potentially evaluate how optimal the pricing of an airbnb is, if we would obtain substantial gains from lowering the cleaning fee, etc. We will look at building a regressor and evaluation its results on my third post in this series.

## 1 comment