developer.URI syntax
From autoplot.org
Autoplot URIs have caused some controversy. There's not a clear distinction between what part represents the file resource and what represents the data source configuration. In fact, it's the data source that makes this judgement, so clearly some file resources will not be accessible when there are parameter name conflicts. This document tries to formalize the URIs.
The fundamental goal of Autoplot is to allow access to existing data resources. The dataset URI must:
- identify the existing resource
- identify the DataSource that can interpret the existing resource
- minimally configure the DataSource to a point where the existing resource is useful, but also to where a physical parameter is extracted.
They also serve to identify datasets, so they must:
- be complete, so they may be easily conveyed, such as in an email.
- be readable by humans
- be hackable by humans
- be formed by humans with assistance (e.g. documentation, completion, or GUI editor)
See also http://tsds.org/uri_templates, which is used for aggregation, and a copy of this is uri_templates.
Contents |
1. URIs, not URLs
Calling them URIs fixes a lot of problems. To call them URLs implies that a web browser should be able to handle them reasonably. Colloquial URIs that look like URLs will be promoted into Autoplot URIs.
2. Schema
All URIs are implicitly "vap", so
/home/jbf/mydata.qds
is translated to the explicit URI
vap+qds:file:///home/jbf/mydata.qds.
3. Syntax
vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][;<serverQueryParams>][?<dataSourceParams>][|<filter>]* \--- data source ---/\----------------------------------------file resource-------------------------------------------------/\---- data source --/\--filters-/ \--- data source ---/\----------------------------------------resource URI--------------------------------------------------/\---- data source --/\--filters-/ alternate being considered that uses hash (#) to delimit the server-side and client-side parts (no hash implies all client-side): vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][&<serverQueryParams>][#<dataSourceParams>][|<filter>]* \--- data source ---/\----------------------------------------file resource-------------------------------------------------/\---- data source --/\--filters-/ vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][&<dataSourceParams>][!<filter>]*
Note when the file resource protocol is http or https, a semicolon is used to delimit the server query params, and the last semicolon should be replaced with a question mark "?" before forming the URL. This is not implemented, and the file cache mechanism will need to be modified.
The URIs may not contain a file resource at all. This shows that ultimately the DataSource must interpret the URI:
vap+jbdc:mysql://192.168.0.203:3306/temperatures.sql?table=temperatures&depend0=time&temperature
Sometimes the URI contains parameters that are used by the server and the DataSource. Here, the both the OpenDAP server and the DataSource use the parameter "Proton_Density":
http://cdaweb.gsfc.nasa.gov/cgi-bin/opendap/nph-dods/istp_public/data/genesis/3dl2_gim/2003/genesis_3dl2_gim_20030501_v01.cdf.dds?Proton_Density
It is the data source that knows to pass on the parameter list to the server.
3.1. data source params
We the have conventional structure to the query part of the URI:
?Density(10000:20000:1000)¶m1=value1¶m2=value2
data_source_params = [ ds_name_subset ] [ "&" param_name "=" param_value | switch_name ] * ds_name_subset = identifier [ "(" start_index ":" [ end_index ] [ ":" stride ] ")" ] param_name = identifier param_value = param_value_gen | param_value_typed | param_value_list ( to allow for parameter specification, we call out typed parameters ) param_value_gen = ( alpha | num | "." | ":" | "+" ) * param_value_typed = string | int | float | sci param_value_list = param_value_typed [ "," param_value_typed ] * string = param_value_gen int = [ "+" | "-" ] digit+ float = real numbers, avoiding use of scientific notation sci = real numbers, using scientific notation identifer = alpha [ alpha | num | "_" ] * start_index = int end_index = int stride = digit+ (python supports negative stride, but no data sources do.)
Note existing code uses brackets "[" "]" in subset, but this is discouraged in the URI spec http://www.ietf.org/rfc/rfc2396.txt
4. Conventions
4.1. Subsetting
<parameter>[start:stop:stride] is used to specify dataset subsetting. Subsetting is handled by the data source. start is the first index, and is zero if not specified. stop is the non-inclusive last index, and is the length if not specified. Negative indices specify index with respect to the last element. This may not be uniformly supported as it requires the number of elements is known. stride is the index increment, and is 1 if not specified. See Python conventions for array indexing.
5: all but the first five elements. 5:20 15 elements, the first element is 5. 0:100:5 elements 0,5,10,...,and 95. : all elements ::5 every fifth element -5: the last five elements.
Note some data sources don't follow this convention, such as the OpenDAP data source. In this case, the existing convention for OpenDAP is used. Data Sources must clearly document the convention used.
4.2. Physical Parameter Name
An unnamed parameter in the dataSourceParams part should identify the primary parameter to plot.
vap:file:///data/params.cdf?param1
When a parameter name is used, the dataset returned ought to be named using the parameter name. The parameter names should be C-style identifiers, and when the existing resource contains a name that is not conforming, then a mapping should be used. For example, an excel worksheet contains a column "Phase Angle." This column may be identified as "Phase_Angle" and the DataSource. The name should be formed by converting spaces, slashes, and pluses to underscores; prefixing ids that start with a number with an underscore; and removing other characters.
Phase Angle -> Phase_Angle 1 -> _1 L-Shell -> LShell V, km/s -> V_km_s
When a label contains parentheses, these should be interpreted as providing a context for the data. In this case, the parenthesis contents should be mapped as with the label name, when the context cannot be identified, and then the first parenthesis is replaced with an underscore.
Bx(gsm) -> Bx_gsm Speed(km/s)->Speed UNITS=km_s
Note that droplist should use the labels when presenting options, but then use these rules to create valid names (identifiers).
4.3. reserved Parameter names
Here is a set of parameter names to promote consistency across data sources. This should not imply that all data sources accept these parameters. However if such functionality is to be supported, then this is how the parameter should be specified.
4.3.1. timerange=<datumRange>
Time range subset.
4.3.2. resolution=<datum>
Resolution parameter that implies that averaging is be done to reduce data.
4.3.3. cadence=<datum>
Subsetting parameter that implies that subsampling is be done to reduce data.
4.3.4. depend0=<name>
name of the physical parameter to use as the independent parameter for the primary physical parameter.
4.3.5. rank2[=start,stop]
create the physical parameter by trimming a rank 2 table. For example an ascii table or excel worksheet is inherently rank 2, the first dimension are the records and the second dimension is the fields. Typically the parameter is extracted by slicing the second dimension to get a column. Instead, trim the rank 2 table to create a new rank2 table that is the result.
4.3.6. skip<RecordType>=<count>
skip this number of records in the medium. (e.g. Lines in an ascii file.)
4.3.7. first<RecordType>=<int>
use this record in the medium as the first record. (Row number in an excel spreadsheet.)
4.3.8. recCount=<int>
This parameter is redundant since the subsetting notation provides the same functionality, but it's often easier to specify the limit rather than form the end from the start:
data[25:150] data[50:175]
vs
data[25:]&recCount=125 data[50:]&recCount=125
4.3.9. where(condition)
subset with condition. Condition contains gt,lt,=,',and,or,etc to constrain search. (e.g. constrain in sql query.)
vap:sql:jbdc:mysql://192.168.0.203:3306/temperatures.sql?table=temperatures&temperature&depend0=time&where(temperature.lt(32.))
The parameter names in the condition must match the dataset names.
Note this is analogous to the trim operation when the condition is a MONOTONIC DEPEND dataset.
Consider time.within('2009-01-01') as an alternative to time.ge('2009-01-01T00:00').and(time.lt('2009-01-02T00:00')). LANL team has asked for energy ranges (&energy.within('30to200')). (not implemented)
4.3.10. near(time,2009-10-29T15:02)
slice at the index closest to the condition. The first parameter identifies a dataset and should be a DEPEND of the dataset. The UNIT property of this dataset is used to parse the second parameter value, and the index of the closest, relevant value is used.
(not implemented)
4.3.11. validMin,validMax
4.3.12. fillValue
4.3.13. arg_0,arg_1,arg_2,...
arg_<N> refers to the to Nth positional parameter, and shouldn't be used for a parameter name.
4.4. URI Encoding
+ should be used to encode spaces, rather than %20. %2B should be used to encode +. Spaces occur in these URIs more often, and this improves readability.
4.4.1. colloquial URIs
- these are Strings that can be converted into URIs.
- spaces in file names are converted into %20.
- spaces in parameter lists are converted into pluses.
- pluses in parameter lists are converted into %2B.
- note that if there are pluses but the URI is valid, then pluses may be left alone.
4.4.1.1. Test URIS
- vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He++%20Density
- vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He++ Density
- vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He%2B%2B%20Density
4.4.2. Resources
- https://sourceforge.net/tracker/?func=detail&aid=3049295&group_id=199733&atid=970682
- https://sourceforge.net/tracker/index.php?func=detail&aid=3004112&group_id=199733&atid=970682
- https://sourceforge.net/tracker/index.php?func=detail&aid=3038349&group_id=199733&atid=970682
- https://sourceforge.net/tracker/index.php?func=detail&aid=3049295&group_id=199733&atid=970682
- https://sourceforge.net/tracker/index.php?func=detail&aid=3056851&group_id=199733&atid=970682
- https://sourceforge.net/tracker/index.php?func=detail&aid=2903189&group_id=199733&atid=970682
- http://trprod.proofstat.com/homepages/webcatalog/cec/specs/special_c.htm
5. Template specs
Currently percents are used to identify template fields, such as timeFormat=%Y+%j+%H in ascii table parser. When these are percent-encoded, they are difficult to read. Therefore, the convention should be to use dollar sign ($) instead of percent, which is valid in a URI:
timeFormat=$Y+$J+$H
Also, a pair of parenthesis following the dollar sign (e.g. $(H;cadence=6)) should be used to further add template configuration.
/home/jbf/data/data$Y$J$(H;cadence=6).dat /home/jbf/data/cluster$(scid;enum=1,2,3,4).dat
5.1. Automatic GUI creation
It's important to establish norms for URI templates to allow for future features such as automatic GUI creation. Autoplot 2012b has the first implementation of automatic GUI creation, based on it's completions. Another feature that could come out of this is automated testing, where a script could be exercised through a generated list of known valid inputs.
5.2. Consider Regular Expressions with named groups
It looks like Java's regular expressions do not support named groups, but it should be easy to develop. According to http://www.petefreitag.com/item/281.cfm, Python, PHP and .NET support named groups, so there should be good conventions to follow.
5.3. examples of templates
$(type=int;validmin=0;validmax=4) $(version;format='%4.2f') $(type=string,name=scid;enum=c1,c2,c3,c4,format='%2s',regex='c[1-4]')
/home/jbf/data/data.bin?encoding=$(enum=pcm,alaw)&scale=(float;name=scale;when(enum.eq(pcm)))
5.4. reserved names for templates
time fields (see unix date command): Y,y,m,d,j,H,M,S,ms,us,ps,milliseconds,microseconds,N,nanoseconds,picoseconds
month names: b,B
am/pm: p,P
versioning: v, V, version (note deviation from date format.)
time zone: z, Z
locale-specific time: x, X (note X is skip/ignore elsewhere. Suggest ${date} for locale specific-time, and discourage use of this field.)
ignore / skip: ignore file_$Y_$(ignore).dat (note X is used elsewhere: file_$Y_$4X.dat)
5.4.1. reserved parameter names
cadence
validmin, validmax
enum
type (integer or real)
when: reserved to indicate when optional fields are required.
label: used to apply a label to a data source that doesn't support labels
units: used to apply a units to a data source that doesn't support units
6. Use Cases
6.1. URI Completion
We'd like to allow scheme-aware URI completion. For example, triggering completion on "vap+" would list the plug-in extensions Autoplot knows about. Using "<C>" to indicate where the trigger is invoked, here are other examples:
vap+<C> shows a list of known datasource identifiers (extensions). vap:jdbc:<C> shows a list of known jbdc plugins. vap:jbdc:mysql://192.168.0.203:3306/<C> shows a list of databases. /<C> shows a list of local files in the root directory.
7. Discussion
We wish to uniquely identify datasets with URIs.
The URL is identified to a DataSource, and then the DataSource is used to produce the dataset. Presently the URL's extension is used to identify the DataSource type, or when a mime type is available, the mime type may be used. Example DataSource types include AsciiTableDataSource (.dat), NetCDFDataSource (.nc), and CdfDataSource (.cdf).
DataSources may advertise additional capabilities, for example the ability to provide views of a long time series. Clients using this DataSource may use this interface to get more data from the DataSource.
7.1. URL spec
See http://gbiv.com/protocols/uri/rfc/rfc3986.html.
7.2. scheme
In general, the scheme and heir part work as they would any web browser. They specify a resource that can be accessed and interpreted once the format is known.
Scheme may be file, http, or ftp. In the future https and sftp may be supported as well.
LEGACY, WILL BE DROPPED: When a period is found in the scheme, the prefix before the dash is used to specify the data source type as as with an extension. For example,
bin.http://www.autoplot.org/data/autoplot.cdf
uses the BinaryDataSource (.bin) to load the resource rather than CdfDataSource (.cdf).
When the URI is qualified with "vap+<ext>:", the data source type is explicitly specified. For example,
vap+bin:http://www.autoplot.org/data/autoplot.cdf
uses the BinaryDataSource (.bin) to load the resource rather than CdfDataSource (.cdf).
7.3. query section
Here is the spec for the query section, from rfc3986:
query = *( pchar / "/" / "?" ) pchar = unreserved / pct-encoded / sub-delims / ":" / "@" pct-encoded = "%" HEXDIG HEXDIG unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
(and here is the rest, for convenience:)
fragment = *( pchar / "/" / "?" ) reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" authority = [ userinfo "@" ] host [ ":" port ] URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment | _____________________|__ / \ / \ urn:example:animal:ferret:nose
There should be a one-to-one mapping to an xml tree that configures the DataSource. This way, an xml schema can be used to construct and verify valid URLs.
ftp://nssdcftp.gsfc.nasa.gov/spacecraft_data/omni/omni2_1972.dat?time=field0&timeFormat=$Y+$d+$H&column=field17
<datasource type=".dat" > <file>ftp://nssdcftp.gsfc.nasa.gov/spacecraft_data/omni/omni2_1972.dat</file> <time>field0</time> <timeFormat>$Y $d $H</timeFormat> <column>field17</column> </datasource>
And then .dat is resolved to AsciiTableDataSource elsewhere.
7.4. fragment
Often people want to grab resources from a server which takes parameters, and in this case a hash # could be used to indicate the parameters are for the server, not the client data source, and the fragment contains the parameters.
This has not been decided, and it has also been proposed that hash be used to delimit filters to be applied after data is loaded.