Tuesday, August 10, 2010

White paper 14 - solicitation of input

The solicitation of input from the community at large including non-climate fields and discussion of web presence white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

There are no specific recommendations associated with this white paper. It is more a discussion document of options.

8/25 Please associate comments with the appropriate white paper


  1. Why not start - not with wide ranging academic pontification - as in the current document - but by enumerating the open collaborative projects in climate science that already exist - and then developing a plan to contribute to these projects?

    One starting point would be opening up on all IPCC related communication which has been requested by those interested in understanding how IPCC report conclusions are reached.

    Another interesting starting point would be acquiescing to requests for scientific information to all requestors.

  2. Further discussion of IPCC off-topic and will not get posted. Also discussion of historical requests for information. The remaining white papers make perfectly clear that openness and transparency is the aim from the outset. Comment retained solely for first paragraph.

  3. I was surprised to see nothing in the white paper on open source communities and how they work. This includes very basic questions, such as: if the goal is to release as much of the code and data as possible, which license model should be used? Open source communities have experimented with a number of different open license models, each of which has different implications for how the software/data can be used and especially what constraints are placed on subsequent derived works.

    Perhaps a bigger question is what model will be used for accepting contributions to this code (and perhaps the data too, if a crowd-sourcing approach is used for digitization) from the broader community. The contributions white paper really only talks about comments and discussions, rather than code and data contributions. So we will need an appropriate governance model for balancing between control of the code development and inclusion of broader communities, while maintaining quality control.

    As far as I know, there are no precedents for this in the meteorological / climatology community - even community modeling efforts such as NCAR's CCSM aren't truly open source - it has never had to deal with contributions from outside of the climatology research community. Nick Barnes' effort is perhaps the closest, but (as far as I know) hasn't build enough critical mass of contributors yet to offer any useful lessons. So we'll probably need to look at models from open source software development (eg Linux, Apache, Eclipse, OpenOffice). Fogel's book (http://producingoss.com/) is a useful guide to how these communities work.

  4. hi Peter,

    One note:

    "(e.g. the blog for discussion has seen minimal use
    despite being flagged to numerous expert communities through multiple listservs). 162

    It've just made the "skeptical" community aware of your initiative. I just became aware today. I hope everyone stays constructive.

    I'll read through the materials and see if any of my experience is relevant and enlightening.

    I'll contribute my code and experience working with the data that is already available.


  5. of the public on a pro bono basis as well as scientific peers. However, a distinction needs
    to be made between the scientifically useful reproduction activity undertaken by e.g.
    clearclimatecode.org and the scientifically meaningless replication process of running the
    same code on my PC as you ran on yours which only proves I can replicate your mistakes
    on my PC.312

    This is by far the least interesting aspect of re running the code as archived. A scientific paper is merely the "advertisement" that a particular peice of code has been run and has produced the outputs published. It is well known from empirical studies in reproducable research that researchers with startly regularity CANNOT reproduce the figures in their papers. Rerunning the exact code on other systems ENSURES that the figures published are in fact produced by the eact code. Any machine dependencies become apparent. Further, as new data is released the science becomes "recompileable" Thses benefits, recognized, by other fields should not be given short shrift in this document. replication of the exact code IS THE SCIENCE. The scientific claim, is that "this code" produced "that result" Further access to the real code, gives people the ability to modify it and improve it. I would suggest a GPL like licence to ensure that people share back with the same rights.

  6. sampling errors due to environmental effects as well as historical images for the location.
    These pointers could also be linked to the source data records and with some relatively
    simply additional tools could allow time series visualisation for any site and subsequently
    global or grouped visualisation of any data set in a simple intuitive manner.

    one such source already exists. geonames.org. For some datapoints of interest ( say surface stations) a unique geoname can be identified for that location and logged with geonames.org.

  7. NOAA / Scripps data for atmospheric CO2 is distributed from the site (using website widgets) I run at CO2Now.org. (See the CO2 widgets at http://www.co2now.org/Current-CO2/CO2-Widget/). Global temperature data was added, thanks the to the NOAA state of the climate, global analysis that was added recently.

    As a non scientist, I see great value in presenting objective, planetary data in a context that makes the science understandable to ordinary people who want to know what is happening to their planet. In learning how to create this context, there is a key barrier that keeps getting in the way: even if you read enough papers to pinpoint a leading data set that is relevant to a particular issue, it can be hard to know for sure, and sometimes next to impossible to extract the data. It is rare to find primary data published in an excel spreadsheet with a meaningful introduction that explains what is in the data set. It helps to have the data posted online. But more examples and explanations will be neeeded if it is going to be used by interested indviduals. My suggestion: set up a web page for a few planetary health indicators. On that page, include links to the leading data source or sources, the website of the institutions that lead the research and monitoring for that indicator (eg. sea ice extent, sea ice volume, sea level rise) plus provide info on how to see or use the data, plus info about how the data is collected and what variables are of interest to scientists or analysts. It would also be helpful to link to scientific papers that relate to the leading data sets. By helping to put people in touch with that kind of information gives them a great appreciation for climate science work, and it raises the level of trust when people can follow the information they are getting back to the authoritive sources of the data analysis. The NOAA State of the Climate global analysis is a good example. That is a big step forward. I also think a lot more can be done.

    Keep up the good work.

  8. There's also some interesting ideas out there on how to build communities around the creation of new visualization of existing datasets. The best example I know of is ManyEyes (http://manyeyes.alphaworks.ibm.com/manyeyes/). We need to look at such efforts to see whether this kind of community visualization can be supported here.

  9. Well, this is a welcome effort. I havent gone through the whitepapers in detail yet. So I will limit myself to general comments on this effort:

    1. There has to be an effort within this consortium to preserve and make available the raw data from all sources (satellite, ground temp stations etc), even if the raw data wont make sense to most observers.

    2.Pls make available the metadata relevant to the rawdata. It would be good to have any parsing tools for the raw data available as well

    3. Preferably stats computed from the rawdata (only the rawdata) as a means of contrasting with the stats available from adjusted or pre-processed data

    4. Similarly pls make available the adjusted/processed data , in addition to clear description of the methodology used to adjust the raw data.

    5. Pls make available the tools that are used to parse the adjusted data and the stats computed from the adjusted data

    6. Pls post any literature or documents related to the justifications for continued adjustments to raw data, so that the reader understands the progression of adjustments made to the raw data

  10. I'm surprised by Steve Mosher's remark that the "skeptical" community only just found out about this project. It hasn't been much of a secret. I found out about it in May, on William Connolley's blog (WMC was initially quite snarky: http://scienceblogs.com/stoat/2010/05/surfacetemperaturesorg.php). I immediately contacted Peter Thorne and (somewhat to my surprise) was invited. Then there was a flurry of mentions on blogs and email at the end of July.

  11. Just a quick comment:

    Has the Met Office considered publishing data as RDF and linked data? RDF/linked data could help to standardise, structure and interlink all meteorological/climate data across the web. data.gov and data.gov.uk are both planning to release as much data as possible as RDF and various scientific communities are finding RDF/linked data/semantic web technologies to be a great help when it comes to publishing and linking vast, complex data sets (e.g. http://esw.w3.org/Semantic_Bioinformatics).

    True, semantic web technologies are yet to gain critical mass but great leaps have been made recently.

    If you're unsure of what the "semantic web" is then Sir Tim Berners Lee offers a good intro in at 16-minute TED talk video here: http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

  12. Lines 311-313 seem to miss the point that somebody re-running exactly the same software might find, and correct the mistake.
    One general point: as the aims of this project include data integrity and longevity, care should be taken not to adopt storage/access technology that may quickly become unsupported (see http://www.dcc.ac.uk/).

    Andy Horseman & Rob MacKenzie
    Lancaster Environment Centre UK

  13. On software and code (with respect to lines 301 and following).

    Broadly software needs a much higher prominence in this paper. It’s barely mentioned. The key positions that need discussing are “Whenever data is published, publish code also” and “no graphs without code”. The framing is wrong. This, and other white papers, dismiss source codes almost as an irrelevance. The issue is not “can we possibly use Subversion (SVN) to make available our codes”, but “why are scientists not routinely publishing all of their software”.

    To some extent Open Source is a misdirection. I accept that there is some part of the community that wants to republish modified versions of computer codes, but they are in the minority. Most of the community that is asking for source code simply wants to inspect it, run it, and possibly run modified versions for personal use. That would be possible under a wide range of software licenses: Climate Science need not go Open Source for that to open. Though of course, I would recommend an Open Source approach (and in particular a cc-by license).

    To some extent this ties into...

    Line 129 “employer asserts IPR”: the community needs to define the extent to which it is acceptable to have control of data and algorithms in the private sphere. Ideology can inform this debate but pragmatic decisions have to be taken. Few people would argue that source code for commodity tools such as an oscilloscope, a GCMS, an MMTS sensor needs to be made available; but in the case of a unique space borne sensor, perhaps making source code and other engineering documents available would make sense. Contracting practices need to be examined as well, when a contract is made for the supply of (bespoke) software, the default should be that the software will be owned by the purchasing organisation and that it will be published as open source.

  14. Data access and accessibility

    Lines 291 and following.

    “it seems reasonable that any user of the data archive should be required to register and agree ...”.

    Our experience with ccc-gistemp is that access to data without registration is a clear benefit. ccc-gistemp (a reimplementation of the GISS GISTEMP algorithm) is designed to be downloaded and run by any interested member of the public. When someone runs it, the first thing it does is download its input data. This is unlikely to work if registration was required. In general requiring any sort of registration or agreement makes it much harder to bulk process data automatically; precluding it in many cases.

    Following: ... “ideally making use of graphical interfaces”

    Whilst I applaud and appreciate a move towards browsable interfaces (ideally using the web browser as the interface!), there is also a need for “machine-readable” interfaces. “bulk download” should be possible programtically as well as “by hand”. Again ccc-gistemp has useful input here: Generally, HTTP is to be preferred for publishing data (over FTP, say). Filenames (URLs in other words) that do not change are preferred over ones that do. Ideally machine readable catalogues would be made available (a JSON API would be good). Transmission of (certain, simple) subsets of data would be extremely useful as well. Example: From GHCN (Global Historical Climate Data) it would be nice to be able to download the data for a single station without downloading the entire several megabyte dataset; since this corresponds to a single segment of the entire file, this could even be arranged by advertising the file offsets and using HTTP to retrieve a segment (no server side programming necessary).