developer.dataset.traps
From autoplot.org
Purpose: outline data set traps.
Contents |
1. Introduction
QDataSet is the result of looking at the limitations of data models attempted over the years, and trying to create a model that was flexible enough to work in many situations. With this flexibility it is able to adapt, but people have a tough time understanding it. Further, it has the problem that it's so unconstrained that the hundreds of methods that use it might as well be passing Objects around, you don't know which scheme of any dataset from the interface.
2. X,Y,Z,?
When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and you won't be able to model high dimensionality (and also single-dimensionality). Suppose you have Flux sampled at Time, Energy, and Pitch Angle. Maybe Time is clearly "X" and Flux is clearly "Z", but you can see why this breaks down quickly.
3. Scalars
When using X, Y, and Z in your model, you will be trapped into locking your data model to the display, and it's not clear how you would describe scalars (single-dimensionality). Suppose you wish to model scalars, indicating an event time. Would the interface just have a getX() method? Suppose you want to indicate an Energy, so should this be getY()?
4. Names for Dimensionality
Tables have two indexes. Qubes have N indexes. In QDataSet, the rank function returns the number of indices, and you can say rank 2 dataset instead of Table. What should the thing with one index be called? In the old Das2 model we called these "Vectors" which was a bad name. Here are suggested names:
X | SDI | QDataSet |
---|---|---|
Number | Rank 0 DataSet | |
Array | XYData,Data1D | Rank 1 DataSet |
Table | Data2D | Rank 2 DataSet |
ThreeQube (?) | Rank 3 DataSet | |
Rank 4 DataSet |
And words to avoid, because they have other definitions that cause confusion: Scalar, Vector, Matrix
Note in QDataSet, the rank is quite different than the dimensionality. Both a simple spectrogram and a table of N columns are both rank 2, but the spectrogram occupies three dimensions, while the table occupies N dimensions. (The BUNDLE_1 property is used to differentiate.)
5. Zeroth vs First
This still bugs me, where humans want to count things with natural numbers, but indexes should clearly start at 0. TODO: there must be guidelines for this, and this should be considered.
6. Cadence vs Duration
Cadence is the spacing between measurements, which is different than the duration of a measurement.
7. Ratiometric spacing
Ratiometric spacings come up a lot in our field, where you have channels whose centers increase logarithmically. "decibels" and "percentIncrease" are good units for describing the cadence with a single number.
8. Times can be modeled as double since Datum
Times can be modeled effectively with a double and a unit that contains a Datum, such as "seconds since 2010-01-01T00:00."
9. Tags should identify centers, Bins identify ranges, but the best is to identify all three
- When just one number is used to identify a location with an aliasing interval, it must be the center of the interval.
- If an aliasing interval bin must be described, then have a bin with a start and end so there is nothing implicit.
- The best case would be to have all three numbers.