Squirrel Tool - dataset inspection and management

The squirrel command line tool is a front-end to the Squirrel data access infrastructure. It offers functionality to

  • inspect various aspects of a data collection.

  • pre-scan / index file collections.

  • download data from online sources (FDSN web services, earthquake catalogs).

  • manage separate (isolated, local) environments for different projects.

  • manage persistent selections to speed up access to very large datasets.

Command reference

Help

The squirrel tool and its subcommands are self-documenting with the --help option. Run squirrel without any options to get the list of available subcommands. Run squirrel SUBCOMMAND --help to get details about a specific subcommand.

Common options

Options shared between subcommands are grouped into three categories:

  • General options include --loglevel to select the program’s verbosity and --progress to control how progress status is indicated. These are provided by all of Squirrel’s subcommands.

  • Data collection options control which files and other data sources should be aggregated to form a dataset. The --add option to add files and directories. Further options are available to include/exclude files by regular expression patterns, to restrict to use selected content kinds only (waveform, station, channel, response, event), to create persistent data selections and more. Finally, the --dataset option is provided to configure the dataset conveniently in a YAML file rather than repeatedly with the many command line options. Using --dataset includes the possibility to add online data sources.

  • Data query options are used to restrict processing/presentation to a subset of a data collection. They have no influence on the data collection itself, only on what is shown. It is possible to query by time interval (--tmin, --tmax, --time), channel/station code pattern (--codes), and content kinds (--kinds).

Tutorial

Downloading data

We first create a local Squirrel environment, so that all the downloaded files as well as the database are stored in the current directory under .squirrel/. This will make it easier to clean up when we are done (rm -rf .squirrel/). If we omit this step, the user’s global Squirrel environment (~/.pyrocko/cache/squirrel/) is used.

Create local environment (optional):

$ squirrel init

To use a remote data source we can create a dataset description file and pass this to the --dataset option of the various squirrel subcommands. Examples of such dataset description files are provided by the squirrel template command. By chance there already is an example for accessing all LH channels from BGR’s FDSN web service! We can save the example dataset description file with

$ squirrel template bgr-gr-lh.dataset -w
squirrel:psq.cli.template - INFO - File written: bgr-gr-lh.dataset.yaml

The dataset description is a nicely commented YAML file and we could modify it to our liking.

bgr-gr-lh.dataset.yaml
--- !squirrel.Dataset

# All file paths given below are treated relative to the location of this
# configuration file. Here we may give a common prefix. For example, if the
# configuration file is in the sub-directory 'PROJECT/config/', set it to '..'
# so that all paths are relative to 'PROJECT/'.
path_prefix: '.'

# Data sources to be added (LocalData, FDSNSource, CatalogSource, ...)
sources:
- !squirrel.FDSNSource

  # URL or alias of FDSN site.
  site: bgr

  # Uncomment to let metadata expire in 10 days:
  #expires: 10d

  # Waveforms can be optionally shared with other FDSN client configurations,
  # so that data is not downloaded multiple times. The downside may be that in
  # some cases more data than expected is available (if data was previously
  # downloaded for a different application).
  #shared_waveforms: true

  # FDSN query arguments to make metadata queries.
  # See http://www.fdsn.org/webservices/fdsnws-station-1.1.pdf
  # Time span arguments should not be added here, because they are handled
  # automatically by Squirrel.
  query_args:
    network: 'GR'
    channel: 'LH?'

Expert users can get a non-commented version of the file by adding --format brief to the squirrel template command.

Now we tell squirrel to update the meta-information for the time interval of interest.

$ squirrel update --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01
[...]
squirrel update:psq.client.fdsn           - INFO     - FDSN "bgr" metadata: querying...
squirrel update:psq.client.fdsn           - INFO     - FDSN "bgr" metadata: new (expires: never)
[...]
squirrel update:psq.cli.update            - INFO     - Squirrel stats:
  Number of files:               2
  Total size of known files:     87 kB
  Number of index nuts:          160
  Available content kinds:
    channel: 120 1991-09-01 00:00:00.000 - <none>
    station: 40  <none>                  - <none>
  Available codes:
    GR.AHRW..LHE GR.AHRW..LHN GR.AHRW..LHZ GR.AHRW.*    GR.ASSE..LHE GR.ASSE..LHN
    GR.ASSE..LHZ GR.ASSE.*    GR.BFO..LHE  GR.BFO..LHN
    [140 more]
    GR.UBR..LHZ  GR.UBR.*     GR.WET..LHE  GR.WET..LHN  GR.WET..LHZ  GR.WET.*
    GR.ZARR..LHE GR.ZARR..LHN GR.ZARR..LHZ GR.ZARR.*
  Sources:
    client:fdsn:b3ad21f2a866c178889cfdf4f493eba588a59543
  Operators:                     <none>

After fetching the meta information from the FDSN web service, a brief overview of the contents currently known to Squirrel is printed.

If we run the update command a second time, Squirrel informs us that cached metadata has been used:

$ squirrel update --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01
[...]
squirrel update:psq.client.fdsn           - INFO     - FDSN "bgr" metadata: using cached (expires: never)
[...]

It is possible to set an expiration date for the metadata in the dataset configuration.

Next we must give permission to Squirrel to download data given certain constraints. Squirrel will only download waveform data when it has a so-called promise for a given time span and channel. These promises must be explicitly created with the --promises option of squirrel update. We are only interested in vertical component seismograms at this point, so we restrict promise creation to channels ending in ‘Z’.

$ squirrel update --promises --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01 --codes '*.*.*.??Z'
[...]
  Available content kinds:
    channel:          120 1991-09-01 00:00:00.000 - <none>
    station:          40  <none>                  - <none>
    waveform_promise: 40  2021-07-28 00:00:00.000 - 2021-08-01 00:00:00.000
[...]

To actually download the waveforms, we can now use the squirrel summon command.

$ squirrel summon --dataset bgr-gr-lh.dataset.yaml --tmin 2021-07-28 --tmax 2021-08-01

Finally we can have a look at the data.

$ squirrel snuffler --dataset bgr-gr-lh.dataset.yaml

TODO: screenshot snuffler (save as png), no controls, full time window

output of squirrel_tutorial1.png

Waveforms are always downloaded in blocks of reasonable sizes, therefore the downloaded time frame may be slightly larger than the requested time span.

TODO: screenshot snuffler (save as png), waveforms

output of squirrel_tutorial2.png

M8.2 Alaska earthquake.

Dataset conversion

So far the data has been downloaded into a special cache directory maintained by Squirrel. Using the data from there is useful if we will later add more waveforms but sometimes we are interested in creating our own waveform archive in a portable form.

To copy the data downloaded in the previous section into a handy directory structure, we can use the squirrel jackseis command. With its --out-sds-path a standard SDS data directory with day-files in MSEED format is created.

$ squirrel jackseis --dataset bgr-gr-lh.dataset.yaml --out-sds-path data/sds
$ tree data/
data/
└── sds
    └── 2021
        └── GR
            ├── BFO
            │   └── LHZ.D
            │       ├── GR.BFO..LHZ.D.2021.208
            │       ├── GR.BFO..LHZ.D.2021.209
            │       ├── GR.BFO..LHZ.D.2021.210
            │       ├── GR.BFO..LHZ.D.2021.211
            │       ├── GR.BFO..LHZ.D.2021.212
            │       └── GR.BFO..LHZ.D.2021.213
            ├── ...

We will use this dataset as a “local dataset” in the following sections.

Local datasets

Dataset inspection

P