Tuesday, July 27, 2010

White paper 6 - Data provenance, version control, configuration management

The data provenance, version control and configuration management white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

The recommendations are reproduced below:
• As an outcome of the workshop, there should be a clear definition of primary (Level 0) and secondary (Level 1) source database across the spectrum of observing systems which may contribute data to the land surface temperature database.

• We should establish a coordinated international search and rescue of Level 0, primary-source climate data and metadata both documentary and electronic (see wp3.) This effort would recognize and support similar on-going national projects. Once located, the project should (a) provide, if necessary, a secure storage facility for these documents or hard-copies of same, (b) create, where appropriate, digital images of the documents for the archive for traceability and authenticity requirements, (c) key documentary information into digital files (native format in Level 1 and uniform format in Level 2), (d) archive, test and quality-assure raw data files, technical manuals and conversion algorithms which are necessary to understand how the geophysical variable may be unpacked and generated from electronic instrumentation, and (e) securely archive the files for public access and use.

• A certification panel will be selected to rate the authenticity of source material as to its relation to the “primary-source”, i.e. to certify a level of confidence that the Level 1 data, as archived, represents the original values from the Level 0 primary source. The process will often be dynamic, since we anticipate that new information will always become available to confirm or cast doubt on the current authenticity rating.

• Given the extent of this project and the unpredictable nature of the evolution of the archive, the reliance on an active panel to address version-control issues as they arise will be necessary. The panel will investigate the possibility of utilizing commercial off-the-shelf or open-source version control software for electronic files and software code (e.g. Subversion (http://subversion.apache.org/).

• Since one requirement of this project is to preserve older versions of the archive, and that a considerable amount of tedious research will be performed on any one version, it is generally assumed that up-versioning will be performed of the basic, Level 2 digital archive as sparingly as possible.

• The algorithms that produce the datasets used for testing and the datasets themselves must be documented and version-controlled.

• A configuration management board will be selected to initially define the necessary infrastructure, formats and other aspects of archive practices. A permanent board will then be selected to oversee the operation. This board and the version-control panel may be coincident or at least overlapping in membership.


  1. On configuration management, there's not enough discussion of the question of what constitutes a configuration item. For example, what's the appropriate unit for checking into version control? Standard solutions to version control for databases place the schema, the data and sometimes the views under version control. But the data set we're talking about here is big enough that we're likely to want to break it down into smaller pieces for version control purposes.

    The hard problem here I think will be handling updates that affect not just single data points, but entire swaths of data. For example, a minor change in an averaging procedure could alter every datapoint in a derived data product. Regular branch and merge procedures for version control won't be any use for this, which may mean that standard database tools aren't sufficient. So, we must consider the need for novel research into version management strategies for numerical data, that takes into account the relationship between the different level products.

  2. Another point - I think the discussion in the paper might be underestimating the need for versioning of higher level data products. Just as most software developers underestimate the scale of the ongoing evolution of a "finished" piece of software, so there's a risk here of underestimating the need to accommodate an ongoing series of fixes to a well established data product.

  3. Oh, and we need to beef up the discussion on provenance. E.g. can we capture not just what processing steps have been carried out on various higher level products, but also by whom, when and for what purpose?

  4. I think it would be worthwhile to make explicit the management control systems & protocols that are intended for this project. I emphasize specifically control systems with respect to segregation of duties.

    As an example from the private sector, to quote the link provided:

    Segregation of duties is used to ensure that errors or irregularities are prevented or
    detected on a timely basis by employees in the normal course of business
    Segregation of duties:
    - A deliberate fraud is more difficult because it requires collusion of two or more
    persons, and
    - It is much more likely that innocent errors will be found. At the most basic level, it
    means that no single individual should have control over two or more phases of a
    transaction or operation. Management should assign responsibilities to ensure a crosscheck of duties.
    Some conflicting duties are:-
    • Creating vendor and initiate payment to him.
    • Creating invoices and modifying them.
    • Processing inventory, and posting payment.
    • Receiving Checks and writing pay-offs.

    A parallel conflicting duty in the Surface Temperature project with might include:
    - selecting which weather stations to include in a data set, and interpolating (or homogenizing) other data points from this set.

    The example I provided may or may not be apt. But I am sure that in a project of this size and scope there is a place for formal risk analysis and the creation control protocols to mitigate this risk.

    This may be obvious and intended already. If so, thanks in advance. Even if so, I think it is important that the intended risk assessment and controls are made explicit and communicated to all stakeholders.

    Finally, please indulge the link, while not strictly scientific, it is appropriate from a organizational / project governance perspective.


    Thanks for your attention and best of luck with the project.


  5. This paper discusses the problem of identifying what constitutes a configuration item and how often revisions should be made. It is difficult to see how this can be resolved until it is determined exactly of what the products will consist.
    The focus on openness and transparency is important, not least to reduce challenges to the data, but this further implies that revision control and the audit trail should set-up so that it can easily be continued by the end user. That is, a published paper should state the revision, not just the source of the data used, and the data extracted from the database should be packaged to help that end user continue traceability.
    If data is to be supplied on a file basis then traceability is relatively easy. If data is supplied in a way similar to the Data Extractor of the British Atmospheric Data Centre (BADC), for example, then control becomes more difficult. Allowing the user to extract bespoke subsets of the data is obviously convenient for the user, but if traceability and openness are to be the key, then the trade-off between convenience and control must be considered carefully.
    Any processing of data to create higher-level products should be managed using a form of software quality control (SQC), and the source code included in the revision control system. If the source code cannot be made available, because of copyright for example, then the quality of the derived data product should be considered to be downgraded. We have discussed simple ways of preserving traceability for model output from smaller projects, including control of visualisation software used to process and present data, in a recent paper (Horseman et al., Geosci. Model Devel., 3, 189-203, 2010).

    Andy Horseman & Rob MacKenzie
    Lancaster Environment Centre UK

  6. This paper lies at the heart of the Climate Code Foundation's interest in the surface temperatures project. I have too much to say about it to condense into this comment box, but I will try.

    Firstly I would like to emphasize that version control and configuration management are not particularly difficult or arcane areas. They are routine for thousands of companies and organisations around the world, and the necessary tools are in daily use by many millions of people. The answers to many questions in this area are very well-known and understood in the software industry and computer science fields. Also, they do not require heavy-weight organisational structures or processes, or large-scale computing resources.

    Secondly, I would like to emphasize that without proper version control and configuration management, this project will not have any traceability. Traceability is emphasized over and over again throughout the white papers, and this is central to achieving it. If a dataset has been processed by any software, but does not include - in metadata - exact versions of the source datasets for that processing and of the processing software, and also configuration information of the processing system, then that dataset is not traceable.

    Thirdly, this white paper didn't say much about publication, and I think that should be mentioned, as it relates to version control issues. Ideally the databank(s) and version control repositories should all be widely accessible, such that any given version is immediately visible. This is the norm in open-source software projects (such as, for instance, the Firefox web browser, or the Linux kernel). There are many mature web-based VCS systems (e.g. SourceForge, GoogleCode, KnowledgeForge, GitHub) which allow this.

    Lastly, a few very specific points. Lines 204-206 recommends the version control panel should "investigate the possibility" of utilizing COTS or open-source software. To a software professional, this is absolutely a no-brainer: version control software is absolutely essential. For a high-visibility project such as this, a mature off-the-shelf open-source VCS system such as Subversion should be used. *Do not*, under any circumstances, allow yourselves to be locked into a closed commercial VCS system. Many projects have failed and companies have gone bust because they have been let down by their VCS software or its supplier. (Note: not all commercial VCS systems are closed).

    Line 211 (and elsewhere) uses the word "algorithm". This is a little obscure: what is meant here is (almost certainly) the software source code, which implements the algorithm. The word "algorithm" is often used in science and the software industry to mean a natural-language description of a process. Such descriptions will almost always miss out crucial details (e.g. data formats, precisions, operation orders, library choices) which are necessary for proper replication.

    I have so much more to write, which I hope to contribute at the workshop.

  7. I think, as GPCC does, we need three kinds of products. 1) nea-real time, 2) fully used and 3) constant input. Then, each data will be updated and/or upgraded. With my experience of developping daily grid precipitation data over Asia, GTS based data (GSOD of NCDC/NOAA) does not adjust with off-line (NHS's) data. One big issue was definition of "period-of-day". So, in the beginning stage, 1) and 2) should be treated separately.