River Flow Forecasting Using Support Vector Machines

Over the past few months I and a colleague (Brian Wallace) have been working on a river flow forecasting paper. A draft version is available @ River Flow Paper.

The goal of our work was to beat the current forecast methods used by the Department of Water Resources for the April through July American River flow. The Department of Water Resources uses an aggregation of human judgement and linear regression equations for generating their forecasts. Given their methods they are surprisingly hard to beat!

We spent a few months trying different Machine Learning methods with little success. Many of the methods we tried resulted in forecasts that were significantly worse than the current forecasts, a few methods such as a properly trained neural network gave forecasts that were comparable to the current forecasts. Finally, I decided to use a Support Vector Machine (SVM) for producing forecasts, after testing a large combination of parameters the forecasts started being significantly better than the current ones.

The data we used for generating forecasts is available online @ https://github.com/bjwbell/California-Water-Runoff-Forecasting. The takeaway message is that we improved the forecast relative error from ~65% to ~48%. The below table shows the forecasts for the last 10 years.

SVM Forecasts 2001-2010
Year Actual (AcreFt)   Predicted (AcreFt)   |Error| (AcreFt)  
2001    552,626 689,472 136,846
2002    973,817 1,028,681 54,864
2003    1,354,434 459,476 894,957
2004    632,159 713,440 81,281
2005    2,003,878 1,844,360 159,517
2006    2,622,387 2,315,193 307,193
2007    522,651 293,256 229,394
2008    674,287 800,080 125,793
2009    1,068,327 1,253,523 185,196
2010    1,486,780 1,023,649 463,130
Mean 1,189,135 1,042,113 263,817
Root mean squared error 355,856
Relative absolute error 48.65%
Root relative squared error 54.14%

The forecasts currently used by the Department of Water Resources produced relative errors of 63.82% and root relative squared errors of 69.15%. Using modern methods for SVM’s gave us an increase in relative accuracy of over 15%! This was a fantastic result and shows the large payoff in keeping up with the state of art for something as ordinary as river flow forecasting.


Book Review: The Art of R Programming

My former professor, Norm Matloff, wrote “The Art of R Programming” and NoStarch Press was kind enough to send me a review copy.

The Art of R Programming is a straight forward explanation of R for programmers who are reasonably familiar with programming in another language. Matloff makes no assumptions of expertise in C or algorithms and his explanations are succinct and easy to follow.

If you’re aren’t familiar with R, it is a statistical programming language, with some similarities to Matlab.

Rating 9/10

The big advantages of R are (1) it’s high level, (2) reasonably easy to read, (3) functional in nature, (4) simple syntax. If you’re familiar with Python, it has a similar feel. Compared to complex languages such as C++, Java, etc, R is a breadth of fresh air due to the lightness of its syntax. That said as a programming language Python is nicer. R has a few annoyances (for me at least) that make it less pleasant to write in than Python.
A couple of those are:

  • Non-standard assignment operator e.g. to assign 5 to x in R we use “x <- 5" instead of the normal "x = 5" used in other languages. This is annoying because a significant amount of programming is doing assignments and a two character assignment operator is twice as much typing. Contrast this with Python which uses the plain "x = 5".
  • Vector creation using “c(1,2,3,4)”. Vectors in R are similar to lists in Python, it would be more natural to add a little syntactic sugar and use “[1,2,3,4]” for vector creation i.e. the same syntax as Python and many other languages.

The real reason to use R are its statistical libraries, it’s very widely used for statistics and is the most pleasant environment to work in.

The areas Matloff covers are:

1 Why R? 2 Getting Started 3 Vectors
4 Matrices 5 Lists 6 Data Frames
7 Factors and Tables 8 R Programming Structures 9 R Functions
10 Doing Math in R 11 Input/Output 12 Object-Oriented Programming
13 Graphics 14 Debugging 15 Writing Fast R Code
16 Interfacing R to Other Languages 17 Parallel R 18 String Manipulation
19 Installation: R Base, New Packages 20 User Interfaces 21 To Learn More

Much of the material is available online in tutorials such as John Cook’s, R Language For Programmers. The real gems are the chapters “Writing Fast R Code”, “Interfacing R to Other Languages”, and “Parallel R”. These chapters have great information that is not easily discoverable otherwise.

“The Art of R Programming” is a fun read, albeit somewhat specialized. If you need to do statistical work as a programmer I highly recommend buying it and spending an afternoon browsing it.