Tuesday, October 9, 2012

Why are there several variants of the databank merge?

Historically the storage, sharing and rescue of land meteorological data holdings has been incredibly fractured in nature. Different groups and entities at different times have undertaken collections in different ways, often using different identifiers, station names, station locations (and precision) and averaging techniques. Hence the same station may well exist in several of the constituent Stage 2 data decks but with subtely (or not so) different geo-location or data characteristics.

Neither an analyst nor an automated procedure will make the right call 100% of the time. However, an automated procedure does allow us to spin off some plausible alternatives. By spinning off such alternatives we can allow subsequent teams of analysts looking at creating quality controlled and homogenized products to assess the uncertainty in their results to these choices. The alternative is to ignore such uncertainty and yet it clearly will have some bearing on the final data products where errors at this stage (merging when unique, deeming unique when the same record) will affect the final results.

Several members of the Working Group suggested different merge priorities sampling the choice of source decks and their ordering as well as the parameter choices within the code itself. From the methodology summary:

Variant One (colin)
In this variant, the source deck is shifted to prioritize sources that originated from their respective National Meteorological Agencies (NMA’s). This way, the most up to date locally compiled data is favored over consolidated repositories, which may or may not be up to date. In addition, sources that are either raw or quality controlled are favored over homogenized sources.

Variant Two (david)
Here, NMA’s are favored, having TMAX, TMIN, and comprehensive metadata as the highest priority. The overlap threshold is lowered from 60 months to 24 months, in order for more data comparisons to be made.

Variant Three (peter)
The source deck is changed under the following considerations. No TAVG source (or data from mixed sources) is ingested into the merge. This is because there is uncertainty in the calculation of TAVG (ie, it is not always TMAX+TMIN/2). TAVG in the final product is only generated from its respective TMAX and TMIN value. For the remaining sources, GHCN-D is the highest priority, and the rest are ranked by order of longest station record present within the source deck, from longest to shortest. The metadata equation is changed to give weighting to the distance probability (10) over the height (1) and Jaccard (1) probabilities (default is 9, 1, and 5, respectively). Finally the thresholds to merge and unique the station are lowered and favored to merge more stations.

Variant Four (jay)
Within the algorithm, the data comparison test results in three distinct possibilities. The station is merged, unique, or withheld. In this variant, this is altered so the candidate station is either merged or unique.

Variant Five (matt)
All homogenized sources are removed. Nothing else is altered compared to the recommended merge.

Variant Six (more-unique)
Thresholds are adjusted to make more candidate stations unique, thus increasing the overall station count.

Variant Seven (more-merged)
Thresholds are adjusted to make more candidate stations merge with target stations, thus decreasing the overall station count.

These have a substantive impact on several aspects of behavior. Most notably:

Station count
Gridbox coverage
Timeseries behavior
The outlier in each is the 'Peter' variant (yep, that is me) and results from the fact that early records have very few max / min measurements as currently archived. The lower anomaly early in this variant is a result of sampling and not a reflection of fundamental discrepancies. If we sub-sample the other variants to the same very restricted geographic sample they fall back into agreement.

This reflects the very real importance of doing data rescue. We know there are as many data pre-1960 in image / hardcopy only as have been digitized. This will be returned to at a later date.

Of course, if you don't like these variants or just want a play you can create your very own variant simply by downloading and using the code that has been made available alongside the beta release.


  1. Peter, with "outlier" you mean the black line, marked "GHCN-M V3"? It has much less stations as the others, but still has a better "Gridbox coverage" as the lowest red line. Can you explain this low outlier in the Gridbox coverage plot? Which merge method reduced the number of grid boxes so much and do we understand why?

    1. Victor, by outlier I mean the 'Peter variant' which is the odd-man-out of the ensemble. GHCNv3 is even more different than any of the variants, except in gridbox count and early period 'global averages'. The Peter variant is even more odd in these two and is the outlier I was referring to.

      In terms of 'understanding' this is because in some areas we only have average temperature reported. The rationale in the 'Peter' variant was that we don't really know how average is averaged. So, based largely upon my former incarnation in radiosondes, I wanted a realization that was very clean. That consisted solely of max, min and average constructed using a consistent method from these underlying elements. This comes at a hit in station count and in particular spatial completeness but may have better fundamental provenance at least with respect to the derived average timeseries. How important this is is very much up for debate, but putting such a variant there does allow this to be explored at least ...