Over the last few weeks, I’ve been working with monthly rainfall and temperature data in an effort to predict drought in India on a district level. My initial hypothesis was: since climate change entails more extreme temperatures and more extreme weather events such as drought, can I use temperature data to predict drought one year ahead?
This data was challenging as it was time series data with multiple, imbalanced classes. In this post, I want to talk about some lessons learned from working with this difficult data.
Time series part 1: Data leakage
For time series data, it is critical to separate the train and test data by years, rather than randomly. When I first ran my model with a random 80/20 split, my model did rather well. But then I realized I was cheating in a sense because I had brought the future into the past.
Why? Well say I had district A in 1950 in my test set while I had data for districts B, C, and D in 1950 in my train set. My model was learning the rainfall patterns in 1950 and applying them to the district in question.
To avoid this data leakage, I split my data on year, putting data before 1992 into my train set and data after 1992 into my test set.
Time series part 2: Seasonality
Time series in general is tricky because you have to teach your model to utilize past data. It is important to remember that when you give the model an observation to predict, it is only working with one row of data — so that row needs to contain relevant past data. Essentially what this means is having features that shift the data by a given time period.
With only a few lines of python code, it’s easy to use decomposition methods to break rainfall down into a trend, seasonal pattern, and residual. The graph below shows a clear seasonal component.
Seasonal decomposition for rainfall in a district (Adilabad)
This clear seasonality was the reason why I included features such as rainfall last month, rainfall two months ago, rainfall in this month last year, rainfall in this month two years ago, etc. These are called auto-regressive terms as you are regressing the data on a version of itself.
Rather than doing this manually, a grid search with seasonal ARIMA (SARIMA, found in statsmodel) can identify the optimal auto-regressive and differencing features for fitting a model to seasonal time-series data.
Most of the models available in sklearn work with multi-class data. If you have n classes, the confusion matrix just becomes n x n rather than 2 x 2 in a binary case. I found sklearn’s classification report to be of great help in analyzing precision and recall for a multi-class model, and ROC curves can be drawn as long as pos_label is specified.
One way to deal with imbalanced classes is to under-sample the majority data, non-drought data in my case. For example, I trained a model on a dataset that included equal numbers of drought and non-drought data to try to make my model more sensitive to drought.
SARIMA model predictions for rainfall in Adilabad
However with classes so imbalanced, this dataset would potentially have been better conceived of as an exercise in anomaly detection. A SARIMA which was able to accurately model the rainfall trend still tends to predict values within the median range (see graph above). This is to be expected as these are the values that are most likely to happen. But it also makes it a poor predictor for deviations from the norm, i.e. drought.