developer.URI syntax

From autoplot.org

Jump to: navigation, search

Autoplot URIs have caused some controversy. There's not a clear distinction between what part represents the file resource and what represents the data source configuration. In fact, it's the data source that makes this judgement, so clearly some file resources will not be accessible when there are parameter name conflicts. This document tries to formalize the URIs.

The fundamental goal of Autoplot is to allow access to existing data resources. The dataset URI must:

  • identify the existing resource
  • identify the DataSource that can interpret the existing resource
  • minimally configure the DataSource to a point where the existing resource is useful, but also to where a physical parameter is extracted.

They also serve to identify datasets, so they must:

  • be complete, so they may be easily conveyed, such as in an email.
  • be readable by humans
  • be hackable by humans
  • be formed by humans with assistance (e.g. documentation, completion, or GUI editor)

See also http://tsds.org/uri_templates, which is used for aggregation, and a copy of this is uri_templates.

Contents

  1. URIs, not URLs
  2. Schema
  3. Syntax
    1. data source params
  4. Conventions
    1. Subsetting
    2. Physical Parameter Name
    3. reserved Parameter names
      1. timerange=<datumRange>
      2. resolution=<datum>
      3. cadence=<datum>
      4. depend0=<name>
      5. rank2[=start,stop]
      6. skip<RecordType>=<count>
      7. first<RecordType>=<int>
      8. recCount=<int>
      9. where(condition)
      10. near(time,2009-10-29T15:02)
      11. validMin,validMax
      12. fillValue
      13. arg_0,arg_1,arg_2,...
    4. URI Encoding
      1. colloquial URIs
        1. Test URIS
      2. Resources
  5. Template specs
    1. Automatic GUI creation
    2. Consider Regular Expressions with named groups
    3. examples of templates
    4. reserved names for templates
      1. reserved parameter names
  6. Use Cases
    1. URI Completion
  7. Discussion
    1. URL spec
    2. scheme
    3. query section
    4. fragment

1. URIs, not URLs

Calling them URIs fixes a lot of problems. To call them URLs implies that a web browser should be able to handle them reasonably. Colloquial URIs that look like URLs will be promoted into Autoplot URIs.

2. Schema

All URIs are implicitly "vap", so

 /home/jbf/mydata.qds

is translated to the explicit URI

 vap+qds:file:///home/jbf/mydata.qds.

3. Syntax

 vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][;<serverQueryParams>][?<dataSourceParams>][|<filter>]*
 \--- data source ---/\----------------------------------------file resource-------------------------------------------------/\---- data source --/\--filters-/
 \--- data source ---/\----------------------------------------resource URI--------------------------------------------------/\---- data source --/\--filters-/

 alternate being considered that uses hash (#) to delimit the server-side and client-side parts (no hash implies all client-side):
 vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][&<serverQueryParams>][#<dataSourceParams>][|<filter>]*
 \--- data source ---/\----------------------------------------file resource-------------------------------------------------/\---- data source --/\--filters-/
 vap[+<dataSourceID>:](http|https|ftp|file)://[<user>[:<password>]@]<auth>/<filepath>/<filename>[.<ext>][&<dataSourceParams>][!<filter>]*


Note when the file resource protocol is http or https, a semicolon is used to delimit the server query params, and the last semicolon should be replaced with a question mark "?" before forming the URL. This is not implemented, and the file cache mechanism will need to be modified.

The URIs may not contain a file resource at all. This shows that ultimately the DataSource must interpret the URI:

 vap+jbdc:mysql://192.168.0.203:3306/temperatures.sql?table=temperatures&depend0=time&temperature

Sometimes the URI contains parameters that are used by the server and the DataSource. Here, the both the OpenDAP server and the DataSource use the parameter "Proton_Density":

 http://cdaweb.gsfc.nasa.gov/cgi-bin/opendap/nph-dods/istp_public/data/genesis/3dl2_gim/2003/genesis_3dl2_gim_20030501_v01.cdf.dds?Proton_Density

It is the data source that knows to pass on the parameter list to the server.

3.1. data source params

We the have conventional structure to the query part of the URI:

 ?Density(10000:20000:1000)&param1=value1&param2=value2
 data_source_params  =  [ ds_name_subset ] [ "&" param_name "=" param_value | switch_name ] *
 ds_name_subset      =  identifier [ "(" start_index ":" [ end_index ] [ ":" stride ] ")" ]
 param_name          =  identifier
 param_value         =  param_value_gen | param_value_typed | param_value_list   ( to allow for parameter specification, we call out typed parameters )
 param_value_gen     =  ( alpha | num | "." | ":" | "+" ) *
 param_value_typed   =  string | int | float | sci
 param_value_list    =  param_value_typed [ "," param_value_typed ] *
 string              =  param_value_gen
 int                 =  [ "+" | "-" ] digit+
 float               =  real numbers, avoiding use of scientific notation
 sci                 =  real numbers, using scientific notation
 identifer           =  alpha [ alpha | num | "_" ] *
 start_index         =  int
 end_index           =  int
 stride              =  digit+    (python supports negative stride, but no data sources do.)
 

Note existing code uses brackets "[" "]" in subset, but this is discouraged in the URI spec http://www.ietf.org/rfc/rfc2396.txt

4. Conventions

4.1. Subsetting

<parameter>[start:stop:stride] is used to specify dataset subsetting. Subsetting is handled by the data source. start is the first index, and is zero if not specified. stop is the non-inclusive last index, and is the length if not specified. Negative indices specify index with respect to the last element. This may not be uniformly supported as it requires the number of elements is known. stride is the index increment, and is 1 if not specified. See Python conventions for array indexing.

 5:       all but the first five elements.
 5:20     15 elements, the first element is 5.
 0:100:5  elements 0,5,10,...,and 95.
 :        all elements
 ::5      every fifth element
 -5:      the last five elements.

Note some data sources don't follow this convention, such as the OpenDAP data source. In this case, the existing convention for OpenDAP is used. Data Sources must clearly document the convention used.

4.2. Physical Parameter Name

An unnamed parameter in the dataSourceParams part should identify the primary parameter to plot.

  vap:file:///data/params.cdf?param1

When a parameter name is used, the dataset returned ought to be named using the parameter name. The parameter names should be C-style identifiers, and when the existing resource contains a name that is not conforming, then a mapping should be used. For example, an excel worksheet contains a column "Phase Angle." This column may be identified as "Phase_Angle" and the DataSource. The name should be formed by converting spaces, slashes, and pluses to underscores; prefixing ids that start with a number with an underscore; and removing other characters.

 Phase Angle -> Phase_Angle
 1 -> _1
 L-Shell -> LShell
 V, km/s -> V_km_s

When a label contains parentheses, these should be interpreted as providing a context for the data. In this case, the parenthesis contents should be mapped as with the label name, when the context cannot be identified, and then the first parenthesis is replaced with an underscore.

 Bx(gsm) -> Bx_gsm
 Speed(km/s)->Speed UNITS=km_s

Note that droplist should use the labels when presenting options, but then use these rules to create valid names (identifiers).

4.3. reserved Parameter names

Here is a set of parameter names to promote consistency across data sources. This should not imply that all data sources accept these parameters. However if such functionality is to be supported, then this is how the parameter should be specified.

4.3.1. timerange=<datumRange>

Time range subset.

4.3.2. resolution=<datum>

Resolution parameter that implies that averaging is be done to reduce data.

4.3.3. cadence=<datum>

Subsetting parameter that implies that subsampling is be done to reduce data.

4.3.4. depend0=<name>

name of the physical parameter to use as the independent parameter for the primary physical parameter.

4.3.5. rank2[=start,stop]

create the physical parameter by trimming a rank 2 table. For example an ascii table or excel worksheet is inherently rank 2, the first dimension are the records and the second dimension is the fields. Typically the parameter is extracted by slicing the second dimension to get a column. Instead, trim the rank 2 table to create a new rank2 table that is the result.

4.3.6. skip<RecordType>=<count>

skip this number of records in the medium. (e.g. Lines in an ascii file.)

4.3.7. first<RecordType>=<int>

use this record in the medium as the first record. (Row number in an excel spreadsheet.)

4.3.8. recCount=<int>

This parameter is redundant since the subsetting notation provides the same functionality, but it's often easier to specify the limit rather than form the end from the start:

 data[25:150]
 data[50:175]

vs

 data[25:]&recCount=125
 data[50:]&recCount=125

4.3.9. where(condition)

subset with condition. Condition contains gt,lt,=,',and,or,etc to constrain search. (e.g. constrain in sql query.)

 vap:sql:jbdc:mysql://192.168.0.203:3306/temperatures.sql?table=temperatures&temperature&depend0=time&where(temperature.lt(32.))

The parameter names in the condition must match the dataset names.

Note this is analogous to the trim operation when the condition is a MONOTONIC DEPEND dataset.

Consider time.within('2009-01-01') as an alternative to time.ge('2009-01-01T00:00').and(time.lt('2009-01-02T00:00')). LANL team has asked for energy ranges (&energy.within('30to200')). (not implemented)

4.3.10. near(time,2009-10-29T15:02)

slice at the index closest to the condition. The first parameter identifies a dataset and should be a DEPEND of the dataset. The UNIT property of this dataset is used to parse the second parameter value, and the index of the closest, relevant value is used.

(not implemented)

4.3.11. validMin,validMax

4.3.12. fillValue

4.3.13. arg_0,arg_1,arg_2,...

arg_<N> refers to the to Nth positional parameter, and shouldn't be used for a parameter name.

4.4. URI Encoding

+ should be used to encode spaces, rather than %20.  %2B should be used to encode +. Spaces occur in these URIs more often, and this improves readability.

4.4.1. colloquial URIs

  • these are Strings that can be converted into URIs.
  • spaces in file names are converted into %20.
  • spaces in parameter lists are converted into pluses.
  • pluses in parameter lists are converted into %2B.
  • note that if there are pluses but the URI is valid, then pluses may be left alone.

4.4.1.1. Test URIS

  • vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He++%20Density
  • vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He++ Density
  • vap:file:///home/jbf/Downloads/CRRES_LOID_1991172000000_V1.cdf?3_He%2B%2B%20Density

4.4.2. Resources

5. Template specs

Currently percents are used to identify template fields, such as timeFormat=%Y+%j+%H in ascii table parser. When these are percent-encoded, they are difficult to read. Therefore, the convention should be to use dollar sign ($) instead of percent, which is valid in a URI:

 timeFormat=$Y+$J+$H

Also, a pair of parenthesis following the dollar sign (e.g. $(H;cadence=6)) should be used to further add template configuration.

 /home/jbf/data/data$Y$J$(H;cadence=6).dat
 /home/jbf/data/cluster$(scid;enum=1,2,3,4).dat

5.1. Automatic GUI creation

It's important to establish norms for URI templates to allow for future features such as automatic GUI creation. Autoplot 2012b has the first implementation of automatic GUI creation, based on it's completions. Another feature that could come out of this is automated testing, where a script could be exercised through a generated list of known valid inputs.

5.2. Consider Regular Expressions with named groups

It looks like Java's regular expressions do not support named groups, but it should be easy to develop. According to http://www.petefreitag.com/item/281.cfm, Python, PHP and .NET support named groups, so there should be good conventions to follow.

5.3. examples of templates

$(type=int;validmin=0;validmax=4)
$(version;format='%4.2f')
$(type=string,name=scid;enum=c1,c2,c3,c4,format='%2s',regex='c[1-4]')

/home/jbf/data/data.bin?encoding=$(enum=pcm,alaw)&scale=(float;name=scale;when(enum.eq(pcm)))

5.4. reserved names for templates

time fields (see unix date command): Y,y,m,d,j,H,M,S,ms,us,ps,milliseconds,microseconds,N,nanoseconds,picoseconds

month names: b,B

am/pm: p,P

versioning: v, V, version (note deviation from date format.)

time zone: z, Z

locale-specific time: x, X (note X is skip/ignore elsewhere. Suggest ${date} for locale specific-time, and discourage use of this field.)

ignore / skip: ignore file_$Y_$(ignore).dat (note X is used elsewhere: file_$Y_$4X.dat)

5.4.1. reserved parameter names

cadence

validmin, validmax

enum

type (integer or real)

when: reserved to indicate when optional fields are required.

label: used to apply a label to a data source that doesn't support labels

units: used to apply a units to a data source that doesn't support units

6. Use Cases

6.1. URI Completion

We'd like to allow scheme-aware URI completion. For example, triggering completion on "vap+" would list the plug-in extensions Autoplot knows about. Using "<C>" to indicate where the trigger is invoked, here are other examples:

 vap+<C>     shows a list of known datasource identifiers (extensions).
 vap:jdbc:<C>  shows a list of known jbdc plugins.
 vap:jbdc:mysql://192.168.0.203:3306/<C> shows a list of databases.
 /<C>        shows a list of local files in the root directory.

7. Discussion

We wish to uniquely identify datasets with URIs.

The URL is identified to a DataSource, and then the DataSource is used to produce the dataset. Presently the URL's extension is used to identify the DataSource type, or when a mime type is available, the mime type may be used. Example DataSource types include AsciiTableDataSource (.dat), NetCDFDataSource (.nc), and CdfDataSource (.cdf).

DataSources may advertise additional capabilities, for example the ability to provide views of a long time series. Clients using this DataSource may use this interface to get more data from the DataSource.

7.1. URL spec

See http://gbiv.com/protocols/uri/rfc/rfc3986.html.

7.2. scheme

In general, the scheme and heir part work as they would any web browser. They specify a resource that can be accessed and interpreted once the format is known.

Scheme may be file, http, or ftp. In the future https and sftp may be supported as well.

LEGACY, WILL BE DROPPED: When a period is found in the scheme, the prefix before the dash is used to specify the data source type as as with an extension. For example,

 bin.http://www.autoplot.org/data/autoplot.cdf

uses the BinaryDataSource (.bin) to load the resource rather than CdfDataSource (.cdf).

When the URI is qualified with "vap+<ext>:", the data source type is explicitly specified. For example,

 vap+bin:http://www.autoplot.org/data/autoplot.cdf

uses the BinaryDataSource (.bin) to load the resource rather than CdfDataSource (.cdf).

7.3. query section

Here is the spec for the query section, from rfc3986:

query         = *( pchar / "/" / "?" )

pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

pct-encoded   = "%" HEXDIG HEXDIG

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"

sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

(and here is the rest, for convenience:)

fragment      = *( pchar / "/" / "?" )

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

authority   = [ userinfo "@" ] host [ ":" port ]

URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part   = "//" authority path-abempty
            / path-absolute
            / path-rootless
            / path-empty

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

        foo://example.com:8042/over/there?name=ferret#nose
        \_/   \______________/\_________/ \_________/ \__/
         |           |            |            |        |
      scheme     authority       path        query   fragment
         |   _____________________|__
        / \ /                        \
        urn:example:animal:ferret:nose

There should be a one-to-one mapping to an xml tree that configures the DataSource. This way, an xml schema can be used to construct and verify valid URLs.

ftp://nssdcftp.gsfc.nasa.gov/spacecraft_data/omni/omni2_1972.dat?time=field0&timeFormat=$Y+$d+$H&column=field17

<datasource type=".dat" >
  <file>ftp://nssdcftp.gsfc.nasa.gov/spacecraft_data/omni/omni2_1972.dat</file>
  <time>field0</time>
  <timeFormat>$Y $d $H</timeFormat>
  <column>field17</column>
</datasource>

And then .dat is resolved to AsciiTableDataSource elsewhere.

7.4. fragment

Often people want to grab resources from a server which takes parameters, and in this case a hash # could be used to indicate the parameters are for the server, not the client data source, and the fragment contains the parameters.

This has not been decided, and it has also been proposed that hash be used to delimit filters to be applied after data is loaded.

Personal tools