Data Documentation Template

Contents

1. Summary

2. Detailed Data Description

3. Data Access and Tools

4. Data Acquisition and Processing

5. Data Quality, Errors, and Usage Guidance

6. References and Related Publications

7. Historical Information

8. User Feedback

The documentation (which includes all metadata) that accompany each project data set is as important as the data itself. Complete documentation is necessary to ensure appropriate data use and long term stewardship of the data. The IPY data policy cites the Open Archival Information System (OAIS) Reference Model in defining complete documentation as "all the information necessary for data to be independently understood by users and to ensure proper stewardship of the data." The formally structured metadata required in the IPY metdata profile are minimal. Much more information is necessary to ensure data are "independantly understandable," especially given the broad interdisciplinary use of IPY data. Several metadata standards are much more comprehensive and may be sufficient, if used with sufficient detail. Notable examples include the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) (FGDC-STD-001-1998) with Remote Sensing Extensions (RSE) (FGDC-STD-012-2002) and the ISO 19115 and related standards.

Given the broad interdisciplinary breadth of IPY, it is not practical to require all projects use the same comprehensive metadata standard. Instead this document outlines a template for the elements that should be included as part of complete data set documentation. The template is built from existing documentation templates used by the National Snow and Ice Data Center and the Earth Observing Laboratory with some consideration of the OAIS Reference Model and recomendations in the Global Change Science Requirements for Long Term Archiving Report. This is a first draft. I request comments and additions.

The data set documentation should accompany all data set submissions and contain the information listed in the outline below. While it will not be appropriate for each and every data set to have information in each documentation category, the following outline (and content) should be adhered to as closely as possible to make the documentation consistent across all data sets. It is also recommended that a documentation file submission accompany each preliminary and final data set.

Development of the documentation will need input from both investigators and data managers.

1. Summary

return to top of page

1.1 Citation

We strongly encourage users to cite the researchers who developed the data set. Different publications will require different styles of citation, but providing and example can ensourage and help users formally cite data. Examples:

Oberbauer, S. 2000. Ecosystem carbon fluxes, Toolik Lake, Alaska 1995. Boulder, Colorado USA: National Snow and Ice Data Center. Digital media.

Arctic Climatology Project. 2000. Environmental Working Group Arctic meteorology and climate atlas. Edited by F. Fetterer and V. Radionov. Boulder, Colorado USA: National Snow and Ice Data Center. CD-ROM.

1.2 Summary

Provide user with enough information to determine the usefulness of the data set.

Should start with a topic sentence, describing what information is in the data set (sea surface temperature, brightness temperature, snow cover, etc). Good to include parameters, location, temporal coverage info in first few sentences so users can get at-a-glance idea of what this is.

Should include brief statements of the following important information.

  • Parameters, Product type (in situ instruments and sensor data record) and sensor/source information.
  • Temporal and geographical coverage and any existing large gaps in data set coverage.
  • Units and resolution.
  • Data processing level/sampling (gridded, binned, swath) and any other parameters.
  • Ordering information (Standard link to USO or ftp site).
  • Data format, data set size, and product media.
  • Brief discussion of ancillary data sets needed for processing.
  • Read software and analytical tools (if available, no need for great detail in the summary, just mention existence).

1.3 Usage Guidance

Briefly describe what kind of applications are suitable for the data. Describe the original intent of the data collection and potentially broader applications.

1.4 Acknowledgements

The purpose of this section is to acknowledge all major participants in the data collection and assembly process, who might not be covered in the citation. It can also be used to credit funding agencies. Example:

The Polar Pathfinder Sampler CD-ROM was produced by the National Snow and Ice Data Center DAAC under NASA Contract No. NAS5-98070. Pathfinder data were produced under separate grants awarded by the NOAA/NASA Pathfinder Program to the individual investigators listed below...

2. Detailed Data Description

return to top of page

2.1 Introduction

Introduce the data set and to provide an overview of the contents, background, potential applications, and other general information. This should provide more detailed context than Section 1.2 Summary.

2.2 Parameter or Variable

List the scientific variables measured in the data set, along with units of measure for each. Example:

Parameters include soil depth (cm), soil temperature (°C), and radiative flux (W/m2).

You may even choose to have a single table here that gives parameter names, units, ranges, and sample values, instead of using some of the subheadings below.

Parameter Description

Definition and units of scientific variables in the data set.

Parameter Range

The range of data values that exist for the data. Include a list of valid values for codes that indicate missing values, quality flags, errors, etc,.

2.3 Data Coverage, Representation, and Resolution

Temporal Coverage

The period of time which the data collection covered, more or less continuously. Be sure to list temporal data gaps if any. Indicate if data are ongoing "to present." A figure showing gaps for various parameters can be useful.

Temporal Resolution

Describe the optimum and typical intervals between measurements during the periods of data collection. This can be the sampling frequency for an instrument and the intervals between measurement periods. It can also be the length of time it takes to collect an entire sample or scan.

Example for a remote sensing data set:

Each swath spans approximately 50 minutes. The data sampling interval is 2.6 msec for each 1.5-sec scan period for the 6.9 GHz to 36.5 GHz channels, and 1.3 msec for the 89.0 GHz channel. AMSR-E collects 243 data points per scan for the 6.9 GHz to 36.5 GHz channels, and 486 data points for the 89.0 GHz channel.

Example for a basic in-situ data set:

Data were collected once per day from October to December 1995.

Spatial Coverage

Provide more information than the four bounding coordinates in the metadata. Describe the coordinates of individual files, granules, or polygons. An image or map showing the coverage may be useful. Where appropriate include official and local place names and geographic context.

Spatial Resolution

For gridded data describe both the grid size and the actual resolution plus any resampling performed. Describe any interpolations schemes. Example:

Input 6.9 GHz data, corresponding to a 56 km mean spatial resolution, are resampled to a global cylindrical 25 km Equal-Area Scalable Earth Grid (EASE-Grid) cell spacing.

For vector data [xxx]

In the case of data collected in the field, you can list the sampling interval if you know it. Example:

Data were sampled every 1.5 meters along the transect.

Projection

Describe the projection, ellispoid, Coordinate Reference System used in processing the data. [xxx]

Grid Description

Describe the method and procedure for gridding and/or binning the data (for gridded data sets). Give the dimensions of the grid and the locations of the corner points.

2.4 Format

Describe the format, structure, and dimensions of the data in detail. Example:

The number of columns and records of ASCII data with an explanation of each column heading, and the units of the data. Indicate the format used for storing the data (compressed or uncompressed).

A 316 x 348 array of 4-byte long-word integers.

Sample data or images are useful.

Describing gridded binary data

The following information is particularly helpful to users of flat-binary data. Try to add as much as possible, if you can confirm it.

  1. The number or rows and columns.
  2. Whether the data are ordered by row or by column.
  3. Whether or not there is a header. If so, what is its size?
  4. What is the data type? Some options include:
    • Byte (8 bits)
    • Short Integer (2 bytes or 16 bits)
    • Long Integer (4 bytes or 32 bits)
    • Double Floating-Point (8 bytes or 64 bits)
  5. What is the byte order? This only applies to multiple-byte data, not single byte data. There are only two options (with various synonyms):

    • little endian (do not use terms such as PC byte order, least significant byte (LSB) order, or "host" byte order)
    • big endian (do not use terms such as most significant byte (MSB) order or "network" byte order)

    Preferred method of stating byte order: "These data are in little endian byte order." Or vice versa. Have a scientific programmer confirm the byte order information.

  6. If the data include more than one band of information within a single file, you will have to also state how these bands are interleaved within the file. There are three options:

    • BSQ = band sequential
    • BIL = band interleaved by line
    • BIP = band interleaved by pixel

    This information is unnecessary for data with only one band.

2.5 File and Directory Structure

Explain how the data are organized. List directories and subdirectories. If data files are provided in zipped files, explain the contents of the zipped file, especially when the zip file contains multiple directories.

File Naming Convention

Explains the file naming convention in detail. Example.

File names for F13 SSM/I data use the convention ssmi-f13-gggyyyymmdd.ccc where:

  • ggg = grid (n3a, n3b, s3a, s3b)
  • yyyy = year
  • mm = month
  • dd = day
  • ccc = product channel (19v, 19h, 22v, 37v, 37h, 85v, 85h)

File Size

Specify size of individual files; or provide a range of sizes if you have many files. Including a total size for the entire data set. Example:

File sizes range from 50 KB to 2.1 MB. There are 212 files, totaling 287 MB.

Fixity Information

Describe any authentication mechanisms and authentication keys used to ensure that the data has not changed in an undocumented manner. Examples of fixty mechanisms include checksums, message digests, and digital signatures. The Message-Digest algorithm 5 (MD5) is a common approach.

Sample Data Record or Browse Image

Show sample data record. For an ASCII file, explain the columns.

You may have sample images for binary data. If your data set contains browse images, show a few here. Create thumbnails that link to larger images.

For gridded data, you can show images derived from data, but be sure to explain that these are not representative samples of the actual data.

3. Data Access and Tools

return to top of page

3.1 Data Access

Describe how a user could obtain the data. The specific link or program call should be included in the metadata, but this is a place to provide more details and explain alternatives. It is especially important for non-digital data.

3.2 Volume

Specify the total volume of the product. The purpose of this field is to help the user decide whether and how they could transfer the entire data and to tell them how much storage space is required to hold the entire data set.

3.3 Software and Tools

Describe software that is available for working with this data set. Provide references and URLs to sites where they fully described, if possible. Try and include open source software.

4. Data Acquisition and Processing

return to top of page

Describe methods of data collection and processing, potentially including instrument descriptions, sampling strategies, laboratory methods, processing steps, calculated variables, theory of measurements, data sources, and/or any other appropriate information. Cite all relevant literature and include them in the "References" section.

Not all of the subheadings in this section many be needed. Add, delete, or combine them as makes sense for the data set.

Theory of Measurements

Theoretical basis for the way in which the measurements were made for all data used in creating this data set.

Sensor or Instrument Description

Describe the instrument(s) used to collect data. Include links to technical specifications

Data Acquisition Methods

Describe the procedures for acquiring this data in sufficient detail so that someone else with similar equipment could duplicate the measurements. Note that this is the procedure by which the data were acquired (either collected or where the Principal Investigator got it). It is not the procedure by which the data were processed or computed from the originally obtained data.

If there is relevant calculation information, it goes into 'Processing Steps' and that section is referenced here. For higher level data products, this section should refer to the group or persons from whom the PI obtained the data. A reference to a lower level document describing the collection/processing of the data the PI acquired to produce the data set described here should be made. If no lower level description exists, describe the method by which the original data was acquired, unless it is a routine product acquired from a commercial or government agency (e.g., a USGS map).

Data Source

For derived or value-added products, cite the original data source(s).

Derivation Techniques and Algorithms

Describe any special techniques or algorithms used. This section contains detailed descriptions and references on models and derivation techniques. General statements go into 'Theory of Measurements' section.

Processing Steps

Indicate the sequence of processing steps that the investigator applied to the data. If the data are processed internally to the instrumentation, you do not need to describe that processing in great detail here. This section should concentrate on the processing that is actually done by the investigator.

5. Data Quality, Errors, and Usage Guidance

return to top of page

5.1 Assumptions and Data Uncertainty

This relates to the theory of measurements section above, but it is wise to clearly state the common assumptions experts may take for granted, but that may not be apparent to data users from a different discipline. Similarly, it is important to descibe the general uncertainties around the data that may not be readily recognized by non-experts. This could be a description of the error bars around derived measurements, limits in applied algortihms and theory, uncertainties in source data, etc.

5.2 Data Quality Assessment and Validation

Describe QA and QC processes for the data both during collection and analysis. This could include basic range checks, more comprehensive assessments, or elements of the collection protocol.

Describe any data validation studies.

5.3 Error Sources

Describe specific known errors in the data. How they are indicated and addressed in processing.

5.4 Usage Guidance

Provide appropriate caveats on the use of the data. How should the data NOT be used as well how it should be used. Provide an example of how to actually work with the data.

6. References and Related Publications

return to top of page

6.1 Related Data Collections

List and link to other related data sets.

6.2 Related Publications

Provide descriptions of and references and links to any relevant technical notes, publications, and agreements related to the data set and any source data.

6.3 References Cited.

Provides references and links for any publications referenced in this document.

7. Historical Information

return to top of page

This is information that should be maintained by the data archive to understand how the data have evolved over time data are well preserved

7.1 Provenance

This section is to document any changes to the data over time. Information should include

  • The original provider and location of the data.
  • Any changes (with dates) in data ownership or rights, especially if formally documented.
  • Changes in access restrictions and dates
  • Media migrations
  • Error corrections, reprocessing, recalibration, etc. Major changes should probably lead to a new version of the data set.

7.2 Reference Information

Describe any historical references or nicknames for the data set. Describe how the data set relates to other data set especially other versions of the same data set . Describe the versioning scheme.

8. User Feedback

return to top of page

This section is to capture a log of user experience with the data. Changes to the data that result from this feedback should be noted her and documented in the Provenance section