QDataSet

From autoplot.org

Jump to: navigation, search

The Definitive Specification to QDataSet.

Purpose: This document describes QDataSet as implemented for Autoplot, and should serve as the definition of the working model. Other documents should be considered specifications of a future version of QDataSet. This document will also attempt to version the spec of QDataSet, so that codes can describe compliance.

Contents

  1. Introduction
  2. QDataSet v1.0
  3. Schemes Introduced
  4. DataSet Properties
  5. Schemes
  6. Modification History
    1. 2009-May (v.1.05)
      1. Rank 0 Support introduced
    2. 2009-July (v1.10)
      1. Rank 4 Support introduced
      2. Bundle dimension added
      3. Bundle DataSets with DEPEND_0 in property(DEPEND_0,idx)
    3. 2009-Aug (v1.20)
      1. Rank 2 DEPEND_1
      2. CONTEXT_0 property
    4. 2009-Sep-28 (v1.30)
      1. Metadata Handling
      2. Schemes Used
    5. 2009-Nov-12
    6. 2010-Jun-24
    7. 2010-Sep-23
    8. 2011-Feb-14
    9. 2011-Feb-28
    10. 2011-Apr-13
    11. 2011-Jun-14
    12. 2011-Nov-18
    13. 2012-Feb-14

1. Introduction

IDL, Matlab, Java, C and Fortran all model numbers in code. They can store groups of numbers in arrays, but they do not try to model measurements of data. When building any analysis system, the developer must first create a model for storing data. This is typically done ad-hoc and often with minimal attention, leading to problems down the road.

QDataSet attempts to introduce a language-agnostic method for modeling measurements of data. Other similar systems exist, such as NetCDF, and CDF, but they have components that tend to drag them into a particular domain. QDataSet tries to be as simple as possible, but still allowing complex operations to be done. Further, QDataSet is a simple interface, easily implemented in different environments, often with code that achieves high-performance without loosing abstraction.

This poster describes most cleanly the interface: 2011 Poster

Note: Autoplot uses an implementation of QDataSet, which may very slightly from the specification here. For example, its UNITS property is implemented with Das2 units, where a QDataSet could call for a string.

2. QDataSet v1.0

This is the model as it exists in Autoplot. QDataSet is short for Quick Data Set, which uses a thin syntax layer and semmantics to build abstraction. This allows QDataSet to be implemented in many languages such as Python, IDL, Fortran and C as well as in Java. (Note only the Java implementation exists presently.)

DEPEND_0.  tags the zeroth dimension of the dataset.  rank 1.  slice0 removes this dataset.
DEPEND_<N>.  tags the Nth index (zero based).
PLANE_<N>.  attached planes.

3. Schemes Introduced

Along with versioning the interface, it's useful to identify qdataset schemes encountered and supported.

  • rank 1 timeseries (time)
  • single table spectrogram (time,freq)
  • rank 3 qube (time,energy,pitch)
  • image (hpixels, vpixels, rgb )
  • lat-lon timeseries (lat, lon, time )
  • array of vectors (time,component)
  • array of complex numbers (time,component)
  • array of waveforms (time,offset). DEPEND_1 offsets are rank 2. 2009/08/21
  • engineering spectrogram using RENDER_TYPE=nnSpectrogram
  • array of spectrograms (table,time,freq) This is the general das2 TableDataSet.
  • array F[X,Y,Z] where Y and Z are time-varying with X. (As with CDF.)

4. DataSet Properties

The properties attached to each dataset allow the dataset to describe complicated things.

From https://autoplot.svn.sourceforge.net/svnroot/vxoware/autoplot/trunk/QDataSet/src/org/virbo/dataset/QDataSet.java

DEPEND_0. type QDataSet. This dataset is a dependent parameter of the independent parameter represented in this DataSet. The tags for the DataSet's 0th index are identified by this tags dataset.

DEPEND_1. type QDataSet. This dataset is a dependent parameter of the independent parameter represented in this DataSet. The tags for the DataSet's 1st index are identified by this tags dataset. When DEPEND_1 is rank 2, then its first dimension goes with DEPEND_0 and its second are the tags for the second dimension. (TODO: Is this a QUBE? Check).

DEPEND_2. type QDataSet. This dataset is a dependent parameter of the independent parameter represented in this DataSet. The tags for the DataSet's 2nd index are identified by this tags dataset. When DEPEND_2 is rank 2, then its first dimension goes with DEPEND_0 and its second are the tags for the second dimension. (TODO: Is this a QUBE? Check).

DEPEND_3. type QDataSet. This dataset is a dependent parameter of the independent parameter represented in this DataSet. The tags for the DataSet's 3nd index are identified by this tags dataset. When DEPEND_2 is rank 2, then its first dimension goes with DEPEND_0 and its second are the tags for the second dimension. (TODO: Is this a QUBE? Check).

JOIN_0. type String. This indicates that the 0th dimension is a group of similar datasets, implicitly appending them.

BUNDLE_1. type QDataSet. This dataset describes how the columns should be split up into separate parameters. This rank 2 dataset has a length that is equal to the number of bundled datasets. The values(i,*) are the qube dimensions of the dataset, except for the first dimension. When all the bundled datasets are rank 1, then length(*) will be equal to zero. property(*,UNITS) will yield the unit for each dataset. Bundle dimensions generally add one physical dimension for each bundled dataset. property(*,DEPEND_0) is special, because it will return a string rather than a QDataSet. This string should refer to one of the bundled datasets by its NAME property. (Any property that returns a QDataSet should return a string referring to another dataset in the bundle.) Also the dataset is necessarily a QUBE.

BUNDLE_0. type QDataSet. This dataset describes how the columns should be split up into separate parameters. See BUNDLE_1. Note slicing a dataset on the zeroth dimension will move BUNDLE_1 to BUNDLE_0. Properties defined in this dataset will be overwritten by the BUNDLE dataset's properties. For example, if the dataset has property( UNITS, 0 ) defined as "Hz" but the bundle has property( UNITS,0 ) as "Hertz" then "Hertz" is used.

START_INDEX. type Integer. Only found in a bundle descriptor (BUNDLE_0 or BUNDLE_1), this returns the integer index of the start of the current dataset. If this is null, then the index used to access the value may be used. (E.g. a bundle of Rank 1 datasets.)

DEPEND_NAME_1. type String. Only found in a bundle descriptor, the name of the dataset to use for DEPEND_1 when unbundling.

ELEMENT_NAME. type String. The NAME of the dataset when unbundling the high rank dataset. If not set, then the common parts of the NAMES are used. ELEMENT_NAME=B_GSM, NAME[0]="UT TIME" NAME[1]="B_GSM_X"

ELEMENT_LABEL. type String. The LABEL to use when unbundling the high rank dataset. If not set, then the common parts of the LABELS are used. ELEMENT_LABEL="B, GSM", LABEL[0]="UT TIME" LABEL[1]="B, GSM X"

BINS_1. type String. This comma-delimited list of keywords that describe the boundary type for each column. For example, "min,max" or "c95min,mean,c95max" A bins dimension doesn't add a physical dimension. TODO: describe boundaries Autoplot uses.

BINS_0. type String. This comma-delimited list of keywords that describe the boundary type for each column. For example, "min,max" or "c95min,mean,c95max" A bins dimension doesn't add a physical dimension.

PLANE_0. type QDataSet. Correlated plane of data. An additional dependent DataSet that is correlated by the first index. Note "0" is just a count, and does not refer to the 0th index. All correlated datasets must be correlated by the first index. TODO: what about two rank 2 datasets?

CONTEXT_0. type QDataSet. A dataset that stores the position of a slice or range in a collapsed dimension. In "Flux(Energy) @ Time=2009-03-16T11:19 UT", the Time=... comes from a context property. Note "0" is just a count, and does not refer to the 0th index. A dataset can have any number of contexts: Temperature @ ( Time, Long, Lat ): 37 deg F @ ( 2009-03-16T11:19 UT, 91.5331 deg West, 41.6579 deg North ) Typically this will be a rank 0 dataset, but may also be a rank 1 dataset with a bins dimension.

UNITS. type org.das2.units.Units. The dataset units, found in org.das2.units.Units. This will change to a String, with codes that provide units conversion and "Datum" handling.

FORMAT. type String. Java/C format string for formatting the values. This should imply precision, and codes that serialize data can use this to correctly format the data. Note Java 5 supports field specs like %tY-%tj, and these may be used for time data, as long as only these field types are in the string.

FILL_VALUE. type Number, value to be considered fill (invalid) data. Note NaN is always considered fill and its use is encouraged.

VALID_MIN. type Number. Range bounding measurements to be considered valid. Lower and Upper bounds are inclusive. FILL_VALUE should be used to make the lower bound or upper bound exclusive. Note DatumRange in das2 contains logic is exclusive on the upper bound.


VALID_MAX. type Number. Range bounding measurements to be considered valid. Lower and Upper bounds are inclusive. FILL_VALUE should be used to make the lower bound or upper bound exclusive. Note DatumRange in das2 contains logic is exclusive on the upper bound.

TYPICAL_MIN. type Number. Range used to discover datasets. This should be a reasonable representation of the expected dynamic range of the dataset.

TYPICAL_MAX. type Number. Range used to discover datasets. This should be a reasonable representation of the expected dynamic range of the dataset.

SCALE_TYPE. String, "linear" or "log" "mod24" "mod360" etc.

LABEL. String, Concise Human-consumable label suitable for a plot label. (10 chars)

TITLE. String, Human-consumable string suitable for a plot title. (100 chars).

DESCRIPTION. String. Human-consumable string, between one sentence and four paragraphs in length.

MONOTONIC. Boolean, Boolean.TRUE if dataset is monotonically increasing. Also, the data must not contain invalid values. Generally this will be used with tags datasets. Negative CADENCE implies monotonic decreasing.

WEIGHTS. QDataSet, dataset of same geometry that indicates the weights for each point. Often weights are computed in processing, and this is where they should be stored for other routines. When the weights plane is present, routines can safely ignore the FILL_VALUE, VALID_MIN, and VALID_MAX properties, and use non-zero weight to indicate valid data. Further, averages of averages will compute accurately.

CADENCE. Rank 0 QDataSet, the expected distance between successive measurements where it is valid to make inferences about the data. For example, interpolation is disallowed for points 1.5*CADENCE apart. This property only makes sense with a tags dataset.

DELTA_PLUS. QDataSet of rank 0 or correlated plane limits accuracy. This should be interpreted as the one standard deviation confidence level. See also BINS_i for measurement intervals.

DELTA_MINUS. QDataSet of rank 0 or correlated plane limits accuracy. This should be interpreted as the one standard deviation confidence level. See also BINS_i for measurement intervals.

CACHE_TAG. CacheTag, to be attached to tags datasets. This is an object that represents the coverage and resolution of the interval covered. For example, in Autoplot the TimeSeriesBrowse uses this to keep track of what's already been read.

RENDER_TYPE. String, hint as to preferred rendering method. Examples include "spectrogram", "time_series", and "stack_plot". In Autoplot, this may the name of one of it's

NAME. String, a java identifier that should can be used when an identifier is needed. This is originally introduced for debugging purposes, so datasets can have a concise, meaningful name that is decoupled from the label. When NAMEs are used, properties with the same name should only refer to the named dataset.

QUBE. Boolean.TRUE indicates that the dataset is a "qube," meaning that all dimensions have fixed length and certain optimizations and operators are allowed. Note that when DEPEND_1 is a rank 1 dataset, this implies QUBE. Likewise BUNDLE_1 is a qube.

COORDINATE_FRAME. String, representing the coordinate frame of the vector index. The units of a dataset should be EnumerationUnits which convert the data in this dimension to dimension labels that are understood in the coordinate frame label context. (E.g. X,Y,Z in GSM.) (Note this is before BUNDLE dimensions were formalized and is not used.)

METADATA. Map<String,Object> representing additional properties used by client codes. No interpretation is done of these properties, but they are passed around as much as possible. Object can be String, Integer, Double, or Map<String,Object>. METADATA_MODEL is a string identifying the type of metadata, a scheme for the metadata tree, such as ISTP-CDF or SPASE.

METADATA_MODEL. String identifying a scheme for the metadata tree, such as ISTP or SPASE. This should identify a node's type when the node is present, but should not require that the node be present. When a required node is missing, this should be treated as if none of the metadata is available. This logic is to support aggregating metadata. The constants VALUE_METADATA_MODEL_ISTP and VALUE_METADATA_MODEL_SPASE should be used to identify these.

VERSION. String, human consumable identifying version. Presently this is intended for human consumption, but eventually we may make them usable by software as well. Note if multiple versions go into making a product (e.g. aggregation), The version string should contain space-delimited version ids, so versions must not contain spaces for other purposes. Also two version strings containing the same value can be coalesced. If this is prefixed with "<scheme>:", then this is to be interpreted as such: sep: period-delimited list of numeric sorted: 2.2.0 < 2.15.2 < 10.2.0 alpha: alpha-numeric sorted: 20030202B>20030202A otherwise it should be numerically sorted. (see org.das2.fsm.FileStorageModelNew) Examples: "sep:2.15.2" "2.15" "alpha:20030202B"

SOURCE. String, Human-consumable string identifying the source of a dataset, such as the file or URI from which it was read. Clearly this is easily lost as processes are applied to the data, but when no other source is involved in a process (excluding library code itself), then the source should be preserved.

USER_PROPERTIES. Map<String,Object> representing additional properties used by client codes. No interpretation is done of these properties, but they are passed around as much as possible.

5. Schemes

I thought it would be useful to enumerate colloquial terms for datasets. I use these phrases all over the place in the Autoplot documentation.

  • time series is a dataset with DEPEND_0 a set of timetags, or a JOIN dataset of time series datasets.
  • simple table is a rank 2 dataset, maybe with DEPEND_0 and DEPEND_1 set.
  • join dataset is a dataset that has JOIN_0 set to combine similar datasets into one dataset. "Array of" dimension.
  • set of simple tables is a rank 3 JOIN of simple tables datasets. These are used a lot at the Plasma Wave Group for representing data with instrument mode changes.
  • rank zero dataset is a single value and metadata. This corresponds to a das2 Datum.
  • bounds is a rank 1 dataset with BINs dimension specifying min and max. Typically max would be exclusive and min would be inclusive. This corresponds to a das2 DatumRange.
  • bounding cube is a rank 2 dataset joining a set of bounds. Has JOIN_1.
  • bundle is a set of datasets with different units and possibly rank. This is rank 1 or rank 2, and BUNDLE_0 or BUNDLE_1 is a QDataSet containing the properties and rank of each bundled dataset.
  • simple bundle is a set of datasets all having the same rank and geometry.
  • complex bundle is a bundle of rank 1, rank 2, and rank N datasets.
  • bundle descriptor is the dataset that BUNDLE_1 points to, describing each of the bundled datasets.

Note more abstract schemes are coming to describe scientific use of datasets.

6. Modification History

6.1. 2009-May (v.1.05)

6.1.1. Rank 0 Support introduced

rank zero added to QDataSet, removing need for RankZeroDataSet interface.

6.2. 2009-July (v1.10)

6.2.1. Rank 4 Support introduced

QDataSet interface extended to support rank 4 datasets.

6.2.2. Bundle dimension added

BUNDLE_1. An index can refer to a bundle dimension that bundles datasets together. Previously, DEPEND_1 was used, and it would point to a dataset of dataset labels. Bundles may deprecate planes. Bundle dimension must be the 1st dimension, may not be the zeroth dimension.

6.2.3. Bundle DataSets with DEPEND_0 in property(DEPEND_0,idx)

Slice0 dataset has code that checks for property(DEPEND_0,idx) to see if the DEPEND_0 dataset needs to be moved to the zeroth dimension.

6.3. 2009-Aug (v1.20)

6.3.1. Rank 2 DEPEND_1

DEPEND_1 can be a rank 2 dataset, in which case the dataset is a non-qube. slice0 should slice DEPEND_1 as well. slice1 is an invalid operation. There is code that checks for rank 2 DEPEND_0, and it is illegal for both DEPEND_0 and DEPEND_1 to have rank>1.

6.3.2. CONTEXT_0 property

Slice operations that remove a dimension can identify the slice position using the CONTEXT_<idx> property which is equal to a rank 0 dataset.

  • In "Flux(Energy) @ Time=2009-03-16T11:19 UT", the Time=... comes from a context property.
  • A dataset can have any number of contexts, and "%{CONTEXT}" may refer to a string representing all of them.
    • Temperature @ ( Time, Long, Lat ): 37 °F @ ( 2009-03-16T11:19 UT, 91.5331° West, 41.6579° North )

Conventionally, the contexts should reflect the original nesting of the contexts. If flux(time,energy,pitch) is sliced on energy then time, the context should still be @ ( time, energy ), regardless of the slice order. This will be difficult to implement, so this is not required.

6.4. 2009-Sep-28 (v1.30)

6.4.1. Metadata Handling

  • Introduce METADATA property (Map<String,Object>) and METADATA_MODEL property.
  • These will be treated similar to the USER_PROPERTIES property.
  • Aggregation logic to be determined:
    • propagate nodes with equal value, or
    • allow nodes to have multiple values by appending uniq number (RANGE-1, RANGE-2, etc.), and metadata models should handle aggregation.

6.4.2. Schemes Used

  • Rank 1 "Bins" dataset with SCALE_TYPE property used to indicate autorange result. This is formatted with the DataSetUtil.format.

6.5. 2009-Nov-12

  • With Jon Vandegriff, we agreed that scheme identifiers should be like:
    • duck typing. If it has the properties of an X it is an X. Scheme implementations must be able to identify.
    • inheritance used and should be indicated: Series>GageHeight. I don't have to know what a GageHeight is to know that it is a Series.
    • multiple inheritance used: timetagged,spectrogram means it has the properties of a "timetagged" and "spectrogram"

6.6. 2010-Jun-24

  • properties for dimensions (DEPEND_0, BUNDLE_0, BINS_0, JOIN_0, etc) have their scope limited to that dimension. e.g. property(TITLE,0) would often be the same as property(TITLE), and this was true for dimensional properties as well. Some datasets have property(DEPEND_0), but should not have property(DEPEND_0,0), which is the same as property(DEPEND_1).
  • property names like "LABEL[0,2]" are allowed only if they are the same as property("LABEL",0,2).

6.7. 2010-Sep-23

  • rank 2 DEPEND_2 and DEPEND_3 imply dependence varying with DEPEND_0. Slice0 operator must slice these as well when DEPEND_2 and DEPEND_3 rank==2.

6.8. 2011-Feb-14

  • introduce VERSION and SOURCE. In Autoplot, VERSION can be reported in Renderer label with %{VERSION}
  • Indexed properties can be aliased with <name>__<i0>_<i1>. In Autoplot this is implemented in AbstractDataSource, which stores all properties in one hashmap. Slice0 needs to go through and move <name>__<i0>_<i1> to <name>__<i1>.

6.9. 2011-Feb-28

  • indexed property change shows other problems, and now we remove high-rank indexed properties completely. Now property(NAME) and property(NAME,Index) are the only property accessors (autoplot 2011).
  • JOIN_0 property is now a QDataSet, not a string. This contains an empty dataset which may have the property QDataSet.CACHE_TAG set.
  • "strict" JOIN datasets have the property that all the joined datasets have the same UNITS. Bounding box is a JOIN of BINS datasets, but is not strict. If the JOIN_0 property is set, then the dataset is a strict JOIN. If no property is set, then it is an implicit join.

6.10. 2011-Apr-13

  • Bundles now have same length as the dataset they describe. Before, bundling rank 2 datasets would mean datasets would have different lengths. This resulted in a bunch of broken code that assumed a bundle of Rank 1 datasets, even though the spec always had this. The new property START_INDEX points to the first index of the high-rank dataset, and values and properties are repeated product(qubedims) times. This also allows slice1 to be used to unbundle a rank 1 dataset, even if it is a single component of a rank 2 dataset.
  • DEPENDNAME_1 added so that DEPEND_1 is always a QDataSet. Before DEPEND_1 could be a string if it was in a BUNDLE.
  • ELEMENT_NAME and ELEMENT_LABEL added, so that individual elements can have different NAME and LABEL properties, and ELEMENT_NAME contains the name of the unbundled dataset. If not found, then the NAME for the unbundled dataset should be the common parts of the element names. (B_GSM_X, B_GSM_Y, B_GSM_Z --> B_GSM).

6.11. 2011-Jun-14

  • Consider duck-typing to allow a dataset to be both a simple table (with DEPEND_1) and a bundle (with BUNDLE_1). LANL wants to be able to go back and forth. There's lots of code that assumes that if it's a bundle then it is not a simple table, but the intention all along has been more like duck typing.

6.12. 2011-Nov-18

6.13. 2012-Feb-14

  • Declare DESCRIPTION as part of the properties. This is not supported anywhere, but we've talked about this for a while and we ought to at least pick a name.
Personal tools