Tuesday, December 18, 2012

So what changed between beta1 and beta2?

Even though we documented changes from the beta1 release to beta2, did we actually change the overall results of the databank? Well, there was a noticeable drop of stations over time, especially over the past 50 years:
However this does not mean that we are losing important stations. It was discovered through Nick Stokes’ blog that there may have been duplicates within the beta1 release. After some analysis, we tweaked the algorithm to remove these duplicates. Because of this we have lost the station count, however we are still seeing many more stations than the current operational product of GHCN-M version 3. In the end however, addressing these changes from beta1 to beta2 did not make a major difference on the annual global anomalies.

We are still in beta, however we are pushing forward for a version 1.0.0 release soon!

Monday, December 10, 2012

Where do differences between the databank and GHCNv3 arise?

If you were paying attention in an earlier post on characterizing the first version beta release you will have noted that the databank timeseries behavior is subtly different to that of the 'raw' GHCNv3.

The early period record is slightly cooler than the estimates from GHCNv3 while the last decade is warmer than GHCNv3. The net impact is to increase the apparent trend. This pattern is present in all the merge variants to a greater or lesser degree. This raises the logical question as to why this difference is arising. Is it because the databank's improved number of stations are sampling areas of the globe previously unsampled in GHCNv3 which behaved in a different manner to the restricted GHCNv3 sample from this larger whole or is it down to additional station sampling in areas already sampled by GHCNv3? And if so why? The two graphs below do the obvious thing and split it out simply by averaging over grids present in both and those only in the databank (there is a much smaller population of gridboxes present in v3 but not in the databank which would be grossly too small to have a significant material impact on global estimates being considered here).

With GHCNv3 gridbox sampling (concentrate on (spot the?) difference between red and blue)

New gridboxes.

So, most of the difference appears to relect better sampling regions already sampled. The question of why and what impact it has on homogenization efforts is 'future work' ... and is why we now need multiple groups to take up the challenge of creating new data products from the databank.

Thursday, December 6, 2012

Databank Release: Beta #2

Today, we have released our second beta version of the global land surface databank. This update includes some changes that were made in response to comments on this very blog, along with a few minor tweaks.

The beta2 release can be found here: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/. Within that directory one can find all the data and code used, along with some graphics depicting the results of all the merge variants. A technical description of the merge program (similar to beta1) is also provided, along with a new file documenting changes from beta1 to beta2.

Beta1 is not forgotten and lost forever. All the data and code from beta1 is located in our archive if anyone still wishes to access it: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta1/

Some of the major changes include the following:
  • Added a metadata comparison check of when the data record began
  • Added source data from Sweden, Uruguay, Norway, Canada, and the MetOffice's new HadISD dataset
  • Updated lookup table to determine whether a candidate station is merged, unique, or withheld after a data comparison is made
The original merging methodology can be found here, as well as a description of the changes from beta1 to beta 2 here.

The deadline has passed for new data to be added for an official version 1 release. However there is still plenty of time to provide feedback on all the methodologies used in constructing the databank. Your comments have helped us so far, and we welcome any more that may arise.

Tuesday, November 6, 2012

Taking the temperature of the Earth: Temperature Variability and Change across all Domains of Earth's Surface

There is a session of the same title as this blog post being organized by collaegues in the Earthtemp initiative www.earthtemp.net at next year's EGU meeting. The session details from the EGU meeting website are:

The overarching motivation for this session is the need for better understanding of in-situ measurements and satellite observations to quantify surface temperature (ST). The term "surface temperature" encompasses several distinct temperatures that differently characterize even a single place and time on Earth’s surface, as well as encompassing different domains of Earth’s surface (surface air, sea, land, lakes and ice). Different surface temperatures play inter-connected yet distinct roles in the Earth’s surface system, and are observed with different complementary techniques.

There is a clear need and appetite to improve the interaction of scientists across the in-situ/satellite 'divide' and across all domains of Earth's surface. This will accelerate progress in improving the quality of individual observations and the mutual exploitation of different observing systems over a range of applications.

This session invites oral and poster contributions that emphasize sharing knowledge and make connections across different domains and sub-disciplines. They can include, but are not limited to, topics like:

* How to improve remote sensing of ST in different environments

* Challenges from changes of in-situ observing networks over time

* Current understanding of how different types of ST inter-­relate

* Nature of errors and uncertainties in ST observations

* Mutual/integrated quality control between satellite and in-situ observing systems.

If you are interested in attending abstracts need to be submitted by Jan 9th 2013.

More info can be found at http://meetingorganizer.copernicus.org/EGU2013/session/12115

We will run a guest post by the Earthtemp organizers in the coming weeks outlining what their effort involves and how it is synergistic with the International Surface Temperature Initiative. Watch this space ...

Friday, October 26, 2012

Databank poster at Global Framework for Climate Services User Conference

There is a meeting happening this week in Geneva in preparation for an Extraordinary Meeting of the World Meteorological Organization Congress. The meeting details can be found here. A link to a local low-resolution (still 3Mb for those on slow connections) version of the poster can be found here. The poster outlines progress to date and highlights that there is much that remains to be done. Given that many of the great and the good of the meteorological world are in attendance including delegations from most National Meteorological Services it is hoped that this poster can raise awareness of the databank and help gain access to additional data and promote additional data rescue efforts. That said, nothing is ever likely to change overnight ...

Thursday, October 18, 2012

How do you decide if a station is to be merged, added as unique or withheld?

The above flowchart is a simple visualization on how the merge program works. As you can see there are a number of different options the candidate station can go through. I'm confident enough to say that each and every situation above happens at least once in the recommended merge!

Let's break down this flowchart, starting with the metadata check. The candidate station runs through all the target stations and calculates three metrics:
  • distance probability
  • height probability
  • station name similarity using Jaccard Index
These probabilities range from 0-1, where 0 means no station match and 1 means perfect station match. Using a quasi-Bayesian approach, these three metrics are combined to form a posterior probability of station match (again, between 0 and 1). This is known as the metadata probability. The metadata probability is calculated between the candidate station and all target stations.

Using a threshold of 0.50 we then see what the next step is. If no metadata probability values exceed this threshold, then we check the validity of the individual metadata metric. If it turns out that 2 metrics are really good (> 0.90) and the third one is terrible, we then determine that there is bad or incomplete metadata, and the station is withheld. Otherwise we are confident that the station is unique in its own right and we add it to the target dataset.

If any stations exceed the threshold of 0.50, then we move down the left side of the chart. The next step is data comparisons. Using an overlap of no less than 5 years, we calculate the Index of Agreement, which is a "goodness-of-fit" measure similar to the coefficient of determination, however not as sensitive to outliers. Similar to the metadata probability, this is calculated between the candidate station and all target stations with metadata probability values greater than 0.50.

We then check to see if any comparisons were made. If not, then that means the two stations did not have any overlap period, or it had some overlap, but it didn't exceed the 5 year threshold. At this point one of two different things can happen. We look at the target station with the highest metadata probability. First, if the best probability is greater than 0.85, then the station merges. If not, then it is withheld.

If a data comparison was made via the Index of Agreement, then a lookup table takes into account both IA and the overlap period and creates a probability of station match, as well as station uniqueness. These are then recombined with the metadata probability to form a posterior probability of station "sameness" and station "uniqueness". If any one of these probabilities pass the same threshold, then the candidate station merges with that target station. If no same probabilities pass the same threshold, but a unique probability passes the unique threshold, then the candidate station is unique. Otherwise, the station is withheld.

A more detailed description of the above flowchart can be found here.

Tuesday, October 16, 2012

How do I work out where a station series in the merged product originates?

The above image is an example station from Logan International Airport in Boston, Massachusetts, USA. There were three different sources that went into this merged station. For every temperature value in the merged product, there is a corresponding number. That number represents the spot in the source hierarchy used to merge the stations. Using that number, one can find the station source.

Using the above image, it can be found that the sources belong to GHCN-Daily (source #01, black), russsource (source # 35, red), and ghcnmv2 (source #39, blue). Now that the sources are known, one can find the Stage 2 data for this station. A user can also look further back, and find the original digitized copy (Stage 1), and sometimes even the original paper copy (Stage 0).

Wednesday, October 10, 2012

Where is the databank merge code? How can I make it work?

The merge code was written in FORTRAN 95 and is located within each variants directory. For example, if one were to find the code for the recommended merge, the location is here:


Once uncompressed, there are four files that are required for the program to run correctly. The program will fail if any one of these files are missing. A description of each file is below:

databank_sources.txt: This is the prioritized list of sources that go into the merge program. This file tells the user the name of the source, number of stations, whether it was originally a monthly or daily source, and whether it includes TMAX, TMIN, or TAVG temperature. In order to acquire the source data, one is required to grab the data from the databank monthly stage 2 FTP site.

lookup_IA.txt: This is the lookup table the program reads in to determine the probability of station match and station uniqueness.

merge_module.f95: This is a module the main program calls to when performing certain functions. This was done so simple procedures called multiple of times were only written once. In addition, this provides the user the opportunity to write in their own code and compare results.

merge_main.f95: This is the main program. The first section, named "User Defined Thresholds," is where the user can define directory structures and performance thresholds.

A more description about these files, along with justification, can be found in the merging methodology document. A compiler is required to run the program. There are many different FORTRAN compilers, however the code was written so that it can comply with the g95 compiler, which is free and available to the public here.

Once a compiler (such as g95) is installed, simply type in the following command:

g95 merge_module.f95 merge_main.f95

And you should be good to go! Although not required, the user is strongly encouraged to tweak any of the thresholds and/or priority list in order to achieve different results. 

If here are any questions, feel free to send an e-mail to data.submission@surfacetemperatures.org, or simply comment on this post

Tuesday, October 9, 2012

Why are there several variants of the databank merge?

Historically the storage, sharing and rescue of land meteorological data holdings has been incredibly fractured in nature. Different groups and entities at different times have undertaken collections in different ways, often using different identifiers, station names, station locations (and precision) and averaging techniques. Hence the same station may well exist in several of the constituent Stage 2 data decks but with subtely (or not so) different geo-location or data characteristics.

Neither an analyst nor an automated procedure will make the right call 100% of the time. However, an automated procedure does allow us to spin off some plausible alternatives. By spinning off such alternatives we can allow subsequent teams of analysts looking at creating quality controlled and homogenized products to assess the uncertainty in their results to these choices. The alternative is to ignore such uncertainty and yet it clearly will have some bearing on the final data products where errors at this stage (merging when unique, deeming unique when the same record) will affect the final results.

Several members of the Working Group suggested different merge priorities sampling the choice of source decks and their ordering as well as the parameter choices within the code itself. From the methodology summary:

Variant One (colin)
In this variant, the source deck is shifted to prioritize sources that originated from their respective National Meteorological Agencies (NMA’s). This way, the most up to date locally compiled data is favored over consolidated repositories, which may or may not be up to date. In addition, sources that are either raw or quality controlled are favored over homogenized sources.

Variant Two (david)
Here, NMA’s are favored, having TMAX, TMIN, and comprehensive metadata as the highest priority. The overlap threshold is lowered from 60 months to 24 months, in order for more data comparisons to be made.

Variant Three (peter)
The source deck is changed under the following considerations. No TAVG source (or data from mixed sources) is ingested into the merge. This is because there is uncertainty in the calculation of TAVG (ie, it is not always TMAX+TMIN/2). TAVG in the final product is only generated from its respective TMAX and TMIN value. For the remaining sources, GHCN-D is the highest priority, and the rest are ranked by order of longest station record present within the source deck, from longest to shortest. The metadata equation is changed to give weighting to the distance probability (10) over the height (1) and Jaccard (1) probabilities (default is 9, 1, and 5, respectively). Finally the thresholds to merge and unique the station are lowered and favored to merge more stations.

Variant Four (jay)
Within the algorithm, the data comparison test results in three distinct possibilities. The station is merged, unique, or withheld. In this variant, this is altered so the candidate station is either merged or unique.

Variant Five (matt)
All homogenized sources are removed. Nothing else is altered compared to the recommended merge.

Variant Six (more-unique)
Thresholds are adjusted to make more candidate stations unique, thus increasing the overall station count.

Variant Seven (more-merged)
Thresholds are adjusted to make more candidate stations merge with target stations, thus decreasing the overall station count.

These have a substantive impact on several aspects of behavior. Most notably:

Station count
Gridbox coverage
Timeseries behavior
The outlier in each is the 'Peter' variant (yep, that is me) and results from the fact that early records have very few max / min measurements as currently archived. The lower anomaly early in this variant is a result of sampling and not a reflection of fundamental discrepancies. If we sub-sample the other variants to the same very restricted geographic sample they fall back into agreement.

This reflects the very real importance of doing data rescue. We know there are as many data pre-1960 in image / hardcopy only as have been digitized. This will be returned to at a later date.

Of course, if you don't like these variants or just want a play you can create your very own variant simply by downloading and using the code that has been made available alongside the beta release.

Friday, October 5, 2012

Is it too late to submit data for inclusion in the databank?

Short answer: No

Slightly longer answer: We are still accepting data submissions for inclusion in the first version release until November 30th. At that point we shall provide an updated beta release version with any new data sources that have been received. Even then, there is no end-point for submissions that can be included in subsequent version releases. There is also no point at which we are likely to have ‘too much’ data so any data is useful.

More detail:

Data submissions can range from a single station to large consolidated holdings. Because the merge program attempts to discriminate between different sources, so long as sufficiently accurate geo-location metadata are provided (latitude, longitude, station name and elevation) it should be able to cope with a degree of information redundancy. It is therefore not necessary to ascertain first whether a version of each candidate station record already exists. Particularly if the submission has greater provenance (a link to the original in hard copy / image form, better station metadata including a history of observing practices and instrument changes etc.) it will likely be given priority.  So, do not worry as to whether the data already exists in the Stage 2 holdings unless it is simply a duplicate resubmission of a pre-existing holding (obviously).

If you need help in negotiating release of data to the databank there exist a boilerplate letter of support and a certificate of appreciation (the latter on request). Further, case specific, help can be provided by Databank Working Group members upon request.

Once you have the data we have tried to make its submission as easy as possible. There are submission guidelines which provide the details of what data is required and how to submit. We do not require that data be converted from whatever the native digital format is, in fact we prefer you not to as this may yield errors that are undetectable.  You do, however, need to describe the format sufficiently that a conversion script can be written to convert it to stage 2.

Although the first Stage 3 merged release consists of solely monthly resolution temperature data we strongly encourage submission of data on timescales at one or more of sub-daily, daily and monthly resolution and for multiple meteorological elements and not just temperatures. It is hoped that future releases will include such shorter timescales and additional meteorological elements to just temperature. These will be useful for many scientists and end-users beyond the more restricted aims of the International Surface Temperature Initiative.

Good luck and thanks

Thursday, October 4, 2012

What known issues remain to be addressed with the databank during beta release?

      First and foremost, this beta release affords the opportunity for broader community input to the databank process. Having many eyes on the prize, hands turning over the rocks and boots kicking the tires will make this thing better.  So, there will undoubtedly be a number of issues in addition to those highlighted here that will arise.  We welcome such feedback. There are a number of issues which we know we will address during beta:
  1. Where we have daily sources we will append provenance flags which specify how many daily reports went into each monthly value. This may prove useful to analysts down the line. Where we only have monthlies we will append this flag as a missing value indicator
  2. We will create a version which re-injects the element-wise provenance flags from the constituent stage 2 (source deck) holdings into stage 3 (merged) holdings. For computational efficiency it is impossible to carry the flags through the merge program, but, obviously, information is available to do this as post-processing.
  3. We are well aware that some stations will have poor geo-location metadata (i.e. Spanish stations in the Sahara or older station segments using a different meridian to Greenwich). At present no blacklisting is applied. But we will definitely apply such blacklisting to the first version merge. We are still in the process of collating a list of known geo-location errors to apply such a fix. One of two things will be done: a correction to the geo-location data where this is known and generally accepted; or force the code to withhold the station from the merge. If you find an apparent issue with a station’s location please let us know either through the blog or data.submission@surfacetemperatures.org so that we can investigate and determine what to do.
  4. We plan to release all the stage 1 to stage 2 conversion scripts for completeness. This will be done soon. There are only so many hours in the day and getting the merge in order has been the priority.
  5. We plan to create several output formats to aid usability. These are hoped to include cf compliant netcdf. For now the databank is made available in two ASCII-based formats.
  6. Instigation of a consistent station identifier system that is robust to future data additions and deletions and is consistent with daily identifiers moving forwards.

Beta release of first version of global land surface databank

Today marks the release of the first beta version of the global land surface databank constructed under the auspices of the International Surface Temperature Initiative’s Databank Working Group. The release is of monthly average temperatures from stations around the globe that have been made available without restriction.

The release will be in beta for a period of 3 months before an official first version release. It is hoped that during this time users can take a look and provide feedback (preferably through the Initiative blog) and advice to ensure that the first version release is of the highest possible quality. Additional data submissions received prior to November 30th will be incorporated in the first version release.

The release consists of:
·      Over 40 distinct source decks (compilations / holdings) submitted to the databank to date in Stage 0 (hardcopy / image; where available), Stage 1 (native digital format), and Stage 2 (converted to common format and with provenance flags).
·      A recommended merged product and several variants thereon which have all been built off the stage 2 holdings
·      All code used to process the data merge
·      Documentation necessary to understand at a high level the processing of the data

The release is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/.  The merged product can be found at ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/ . The recommended merge consists of over 39 thousand stations, which range in length from a few years to over two Centuries.

This is data that mostly has not been quality controlled or bias corrected. It is important to stress that it therefore does not constitute a climate data record / dataset suitable for monitoring long-term changes. Rather, it provides a basis from which research groups can create algorithms to produce climate datasets. The results from these algorithms can then be compared and benchmarked as part of the International Surface Temperature Initiative activities. We hope that many groups and individuals take up this challenge which will lead to improved understanding of land surface air temperature changes particularly at regional scales.

This release is the culmination of two years effort by an international group of scientists to produce a truly comprehensive, open and transparent set of fundamental monthly data holdings. In the coming weeks a number of additional postings to the blog will attempt to explain different aspects of this databank.

More information on the Initiative and how to get involved can be found at www.surfacetemperatures.org .

Wednesday, September 26, 2012

On the importance of metadata ...

If you are reading this then you undoubtedly will know that much contention arises over the siting of stations, particularly over the United States where such effects have been documented for the USHCN subset of the COOP network through the citizen science www.surfacestations.org effort. This, and available modern station inventories, shows that most of the network now consists of MMTS sensors (the things that look like the heads of Dalek's in Dr. Who or stacked UFOs) rather than the Cotton Region Shelters (Stevenson Screens) which are white painted wooden ventilated boxes that were used to house liquid in glass thermometers for the majority of the record. As an aside, very early in the record for the US prior to the early twentieth Century, as elsewhere, a large number of approaches were used.

Given that:
  • We know there must have been a change occur between Cotton Region Shelters and MMTS for all USHCN stations that are currently MMTS instrumented.
  • The MMTS has a short electrical lead that will more often than not have involved a relocation of the instrument closer to a power source (building) with lower quality siting characteristics.
  • The MMTS has different measurement characteristics (a tendancy to under-estimate daily maxima and over-estimate daily minima compared to Cotton Region Shelter instrumentation) verified by several side-by-side comparisons, some over several decades.
It is of interest to ask for what period of time the modern station configurations may have been 'representative' i.e. when such sites likely changed location and / or instrument. Both the very likely change in physical measurement location and the very certain change in instrument characteristics will be important considerations in the continuity of the station records and they clearly cannot be divorced from one another on a site-by-site basis. 

Fortunately, for the US we have good, although not complete, metadata. Based upon this its possible to break down when important aspects such as instrument changes, changes in time of observation and other considerations occured both for the USHCN subset and the larger COOP network (although the metadata for the remainder of the COOP is somewhat poorer prior to the mid-twentieth Century). This is shown below (courtesy Claude Williams, NCDC):

Timeseries of frequency of metadata event types for a subset of metadata classes across both USHCN and the broader COOP network. For completeness ASOS is described here 

So, from the above figure its obvious that MMTS transition started in 1982, with the bulk of the transition both in the USHCN and COOP occuring between then and 1990, but some substantial number of such replacements occuring through at least 2000 (some of these may have been replacements of previously installed MMTS sensors). So, how far back can one imply anything from the modern siting of currently MMTS stations? At the earliest 1982 when the replacement program began, and possibly no further back than the early 2000s for some stations. Beyond that careful forensics on a site-by-site basis would be required to ascertain whether the MMTS transition necessitated a change in measurement location. If it did then modern siting would have no bearing on the pre-MMTS segment of the record. Given that the metadata records generally the most significant change it may be rare that the metadata record itself notes whether a change in measurement location was associated with the (more substantial) change in instrumentation.

Bottom line: We need to know not just what the site looks like now but how it has changed over its history if we are to properly assess potential issues of representivity and homogeneity. Current siting tells us about today's measurements, not about yesterday's measurements (or for that matter tomorrows and MMTS has itself now started to be replaced by a newer - although similar - instrument) ... so we need contiguous metadata and not simply snapshots (although they are a valuable start, don't get me wrong) if we are to properly interpret records. Of course outside the US its rare to have access to more than lat, lon, elevation and name, which doesn't mean to say it doesn't exist, rather its not been shared and certainly not in a common format that is machine readable, but that is another post for another time ...

Wednesday, August 1, 2012

Do you want to help with data discovery?

As was alluded to in an earlier posting here, NOAA's National Climatic Data Center has recently endeavored on an effort to discover and rescue a plethora of international holdings in hard copy in its basement and make them usable by the international science community. The resulting images of the records from the first chunk of these efforts have just been made available online. Sadly, it is not realistic at the present time to key these data so they remain stuck in a half-way house, available, tantalizingly so, but not yet truly usable.

So, if you want to undertake some climate sleuthing now is your moment to shine ...! The data have all been placed at ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/daily/stage0/FDL/ . These consist of images at both daily and monthly resolution - don't be fooled by the daily in the ftp site address. If you find a monthly resolution data source you could digitize years worth of records in an evening.

Whether you wish to start with Angola ...

or Zanzibar ...

There is data for you to discover. As the Readme file says ...

The following are documents recently imaged through an effort at the 
National Climatic Data Center (NCDC). NCDC holds over 2000 cubic 
feet of foreign records in paper format of in-situ weather observations 
in the on-site physical archives. The data files included are digital 
photographs of records in this collection. NCDC federal and contract 
employees captured the images using NCDC-owned digital cameras. All 
photos are in JPEG format.

The images contain in-situ observations from all areas of the globe 
other than the United States. African observations were the initial 
geographic area of concentration in the imaging effort but have grown 
to include other continents and regions. The collection consists of 
weather observations that were taken almost exclusively between 1885 
and 1975. The text within the images are in various languages.

A worldwide community of scientists will benefit from this effort, 
which is part of a global effort to discover, scan and key missing 
in-situ data. The data from the images will eventually be added to 
integrated global datasets, including baseline datasets at NCDC.

This directory contains the following:
- 45 tar files containing images of the data
- An inventory file describing the files that were recently examined 
  at NCDC (the ones highlighted in yellow have been imaged)
- An example .csv file describing the preferable format of the data 
  if these images were to be digitized

So, if you want to do some discovery and recovery of data the opportunity is now there to do so. Any data submitted to the databank using the submission guidelines detailed here will be shared without restriction for use by anyone for any purpose. We would strongly encourage keying of all recorded meteorological parameters although clearly temperatures are essential.

Regardless of one's viewpoint there cannot be a downside to improving data availability if one wants to be able to make informed analyses and decisions, particularly so when that data has an unbroken chain of provenance back to the raw paper record. So, this really is an opportunity to provide something uniquely useful to scientists and the public around the world and to 'own' a chunk of the global climate record.

Any help gratefully received.

Wednesday, June 27, 2012

A new homogenised daily temperature data set for Australia

The ACORN-SAT station at Butlers Gorge in central Tasmania
A new homogenised daily temperature set, the AustralianClimate Observations Reference Network – Surface Air Temperature (ACORN-SAT)data set, has recently been released by the Australian Bureau of Meteorology. This data set contains daily data for 112 stations, with 60 of them extending for the full period from 1910 to 2011, and the others opening progressively up until the 1970s. (1910 is taken as the starting point, as it was only with the formation of the Bureau of Meteorology in 1908 as a federal organisation that instrument shelters became standardised). 

The new data set applies differential adjustments to different parts of the daily temperature frequency distribution, using an algorithm which matches percentile points in the frequency distribution with those at reference stations before and after an inhomogeneity. This takes into account the fact that some inhomogeneities in temperature records have different impacts in different weather conditions – for example, if a site moves from a coastal location to one further inland, the difference in overnight minimum temperature will normally be greater on clear, calm nights than on cloudy, windy nights, and clear, calm nights are also likely to be the coldest nights (at least in the mid-latitudes). If such effects are not accounted for, adjustments which homogenise mean temperatures may not homogenise the extremes. 

The method is conceptually similar to that of Della-Marta and Wanner (2006), while differential adjustment methods of this type have also been developed by a number of other authors. The ACORN-SAT data set, however, is believed to be the first implementation of such methods in a national, year-round data set. 

A detailed evaluation of the method was carried out before finalising its implementation. One of the interesting findings from the evaluation was that, while some earlier studies had suggested that reference stations required a very high correlation (0.8 or above) for use in daily adjustment methods, the ACORN-SAT evaluation found that correlations of 0.6 or above produced satisfactory results. The reasons for this difference are yet to be fully evaluated. Two potential explanations are that ACORN-SAT uses multiple reference stations (normally 10) whilst the other methods were evaluated using a single reference station, and that the ACORN-SAT evaluation took place using real data, whilst the other evaluations used artificial data sets whose properties may not necessarily match real-world inhomogeneities. This result is critical for the use of such methods in Australia, where observing networks are sparse in many areas by the standards of developed countries; had a minimum correlation of 0.8 been necessary many locations would have had no available reference stations.
Documentation of the data set at all points was considered to be critical. The homogenised and raw daily data, all of the transfer functions used in the adjustments, and other relevant documentation, are available on the ACORN-SAT website. This information would allow the data set to be reproduced from the raw data should anyone wish to do so.  

The methods used in the development of the data set have recently been published in a paper in the International Journal of Climatology. A more extensive technical report has also been published by the Centre for Australian Weather Climate and Research and is available on the ACORN-SATwebsite. Other material published on the website include a second technical report which compares outcomes from the new data set with other national and international data sets, and a station catalogue containing metadata for the 112 locations. 

(One of the author’s personal goals is to visit all 112 locations; the 91st, Cape Moreton in Queensland, was visited in April. Occasionally his ambitions have exceeded his vehicle’s capabilities – drowning his last car while attempting to get out of a remote site in the far north of Western Australia).

The new data set will be used for operational climate analysis by the Australian Bureau of Meteorology, including reporting of annual national temperature anomalies. (Preliminary analyses indicate that the warming trend in the new data set, about 0.9°C over the 1910-2011 period, is similar to that in the data set previously in use). It will also allow, for the first time, analyses of century-scale changes in Australian temperature extremes. Such analyses are expected to be released over the next few months. 


Della-Marta, P.M. and Wanner, H. 2006. A method of homogenizing the extremes and mean of
daily temperature measurements. J. Clim., 19, 4179-4197.

Trewin, B.C. 2012. A daily homogenised temperature data set for Australia. Int. J. Climatology, published online 13 June 2012.

Trewin, B.C. 2012. Techniques used in the development of the ACORN-SAT dataset. CAWCR Technical Report 49, Centre for Australian Weather and Climate Research, Melbourne.

Thursday, April 12, 2012

Request for help: Identifying, prioritizing and digitizing International hard copy holdings held at NOAA NCDC

Update 4/26: The googledocs share spreadsheet linked below has been updated and simplified which will hopefully make this easier for people to engage in. There are also plans to host these images online soon and hopefully allow anyone interested to digitize and submit the digitized records for inclusion in the data holdings.

Colleagues at NCDC have recently embarked on a project of truly epic proportions. To inventory, image as necessary, and eventually digitize (to the extent useful unique information exists in them) the large volume of international holdings (>2000 boxes) held in hard copy in the NCDC basement.  That is a lot (an awful lot) of boxes ...
These consist of a huge range of different, primarily land in-situ, meteorological holdings. Some may be unique, others may exist elsewhere already as images or have been digitized already. Below are just a couple of teaser images ...
We would value yours and others' collective help in prioritizing the imaging and digitization of these holdings, telling us what has already been done and in actually doing some of the work. 

So, with that ...

An editable form of the current version of the spreadsheet summary with about 15% inventoried (Africa and S. America largely) and 1% imaged (highlighted yellow) is available at:

Please direct edit this, I do not want to have to manage 50 versions of the same spreadsheet or merge them ...! Alternatively you can leave a comment below if you are more comfortable doing that.

Edits can request further forensics (which stations, when, what), give reasons for interest in the data, offer to digitize images from that set of data etc. etc.

In terms of next steps, in coming weeks the land data images taken will very likely start to be hosted on the International Surface Temperature Initiative databank ftp site at NCDC in appropriate stage 0 (raw data imagery) directories. Then, ideally, we will get some help in digitizing these at which point we can start to make the digital data available without restriction through that databank portal and pull it through to NCDC's products as well as allowing others to use and investigate it. This will potentially help us fill in significant gaps in our knowledge of climate change in many regions and periods.

If there is significant interest I will update the spreadsheet with new inventories of boxes periodically.

Some resources which might help in this task should you wish to partake in it:

http://docs.lib.noaa.gov/rescue/data_rescue_home.html - current NOAA foreign data library imagery
ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/ - ever growing resource of digital data for land stations - feel free to play ...
ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage2/INVENTORY_ALL_monthly_stage2  - list of current stations and periods of record (incl. lots of duplicates)
More on the databank effort, including submission guidance for digital holdings that may not already be there, can be found at http://www.surfacetemperatures.org/databank . We are close to releasing a first version of the databank, but it is not too late to receive data submissions for consideration in the first version ...

Many thanks in advance for any help received in this task.