söndag 29 juni 2014

Approximating data in Gnuplot with the fit function and time as xdata.

Gnuplot is the shit. Normally, the biggest problem is finding out (i.e. Google-hunt for examples) how to do what you want and not to determine if it at all possible. However, I just ran into a problem where Gnuplot was buggy. I had a data-series with time-stamps (full dates) on the x-axis and my measurements (two different) on the y-axis and I wanted to approximate the trend with a simple least-square f(x) = l*x + m line. This was seemingly pretty easy to do:

#!/usr/bin/gnuplot -persist
set xdata time

set xlabel "Time"
set ylabel "Measurements" font "Arial,12"

set autoscale
set timefmt "%Y-%m-%d"

set term png xFFFFFF size 1200,700
set output "/tmp/measurements.png"
set datafile separator ";"

f(x) = a*x + b
fit f(x) './measurements.dat' using 1:2 via a, b
g(x) = c*x + d
fit g(x) './measurements.dat' using 1:3 via c, d
plot \
"./measurements.dat" using 1:3 title "Thingie A" with linespoints, \
"./measurements.dat" using 1:6 title "Thingie B" with linespoints, \
f(x) title "Least-squares f(x) = a*x + b approximation", \
g(x) title "Least-squares g(x) = c*x + d approximation"


For the first data-set ("Thingie A"), I got a believable approximation line - i.e., the line looked like it did approximate the general trend of the data-points of the set. However, where the first data-set had on average increasing data-points, the second data-set had generally decreasing points. Yet, the approximation line (g(x)) was increasing.  How could that be? Was my impression of a general decreasing trend an optical illusion? Or was Gnuplot's fit-function somehow buggy? It turned out to be the latter.

The fit-function uses "an implementation of the nonlinear least-squares
 (NLLS) Marquardt-Levenberg algorithm" (cited from the built-in help) and, apparently, that algorithm (or implementation) doesn't do well when the x and y values are very different in size. My y-points in the second data set was in a narrow range from circa 105 to 100 but what was the x values? 


Unlike Unix, Gnuplot doesn't use an Epoch starting with 00:00 January 1st 1970 but actually January 1st 2000:

gnuplot> plot strftime("%Y-%m-%d %H:%M", 0)
         warning: Skipping unreadable file "2000-01-01 00:00"
         No data in plot


Thus, since I have collected my data during this year, all my x-values where around  450.000.000 and, yes, that is quite a lot bigger than y-values around 100...

What to do? Well, the simplest thing to get the x-values closer to the y-values is to divide with the number of a seconds a day (86400) to have the fit-implementation work with x-values in the 5000 range instead.

To do this, I changed the following lines:

 fit g(x/86400) './measurements.dat' using 1:2 via c, d
...
plot \
...
f(g/86400) title "Least-squares g(x) = c*x + d approximation", \
... 

And that was enough, now the approximation line was declining just like my impression of the data-points. Success!