Friday, January 25, 2013

Blacklisting

Things have gone a bit quiet of late I realize. In part this is due to real-life which has a habit of getting in the way. But in large part its because we have been grappling with the creation of a blacklist. 'We' here is the very definition of the royal we as it would be fairer to state that Jared has been grappling with this issue.

There be gremlins in the data decks constituting some of the input data to the databank algorithm - both dubious data and geolocation metadata. We knew this from the start but have stayed blacklisting until we got the algorithm doing sort of what we thought it should and everyone was happy with it. Now we have attacked the problem for several weeks. Here are the four strands of attack:

1. Placing a running F-test through the merged series to find jumps in variance. This found a handful of intra-source cases of craziness. We will delete these stations through blacklisting.

2. Running through NCDC's pairwise homogenization algorithm to see whether any really gigantic breaks in teh series are apparent. This found no such breaks (but rest assured there are breaks and the databank is a raw data holding and not a data product per se).

3. First difference series correlations with proximal neighbors. We looked for cases where correlation was high and distance was high, correlation was low and distance was low and correlation was perfect and distance low. These were then looked at manually. Many are longitude / latitude assignation errors. For example we know Dunedin on the South Island of New Zealand is in the Eastern Hemisphere:
This is Dunedin. Beautiful place ...

And not the Western Hemisphere:
This is not the Dunedin you were looking for ... Dunedin is not the new Atlantis

 But sadly two sources have the sign switched. The algorithm does not know where Dunedin is so is doing what it is supposed to. So, we need to tell it to ignore / correct the metadata for these sources so we don't end up with a phantom station.

There are other issues than simple sign errors in lat / lon that these picked up. One of the data decks has many of its French stations longitudes inflated by a factor of 10, so a station at 1.45 degrees East is wrongly placed at 14.5 degrees East. Pacific island stations appear to have recorded under multiple names and ids which confounds the merging in many cases.

4. As should be obvious from the above we also needed to look at stations proverbially 'in the drink', so we have pulled a high resolution land-sea mask and run through all stations against that. All cases demonstrably wet (greater than 10Km = .1 degree resolution at equator and many sources are only to 0.1 degree accuracy) are getting investigated.

Investigations have used the trusty googlemaps and wikipedia route in general with other approaches where helpful. Its time consuming and thankless. The good news is 'we' (Jared) are (is) nearly there.

The whole blacklist file will be one small text file the algorithm reads and one very large pdf that justifies each line in that text file. As people find other issues (and there undoubtedly will be - we will only catch worst / most obvious offenders even after several weeks on this) we can update and rerun.

No comments:

Post a Comment