Tutorial: writing a Squirrel based tool to calculate hourly RMS values

In this tutorial we will write a user-friendly command line interface (CLI) program to calculate hourly root-mean-square (RMS) values for continuous seismic waveform recordings. Many methods in seismology process waveform recordings in similar ways and so this example can serve as a design pattern for tools creating daily spectrograms, PSD plots or drum plots, earthquake detectors, automatic pickers, and so on. We will show different styles of writing such a script/app, each useful in slightly different situations. If you use the provided building blocks and follow the conventions laid out below, your app will immediately feel familiar to other Squirrel users. And it will ship with a bunch of nice features, like support for extra large datasets. And it will be pleasure to maintain.

Specs for the tutorial app:

  • Calculate hourly RMS values.

  • Get data from local file collection or from online FDSN web services.

  • Keep and reuse downloaded raw data (don’t download same data twice).

  • Process a given time span.

  • Station/channel selection.

  • Efficiently hop through the data in chunks of one hour

  • Always start chunks at full hour.

  • Deconvolve instrument response before calculating RMS.

  • Extinguish processing artifacts by use of overlapping time windows.

Variant 1: quick and dirty script

Feel like hard-coding today? Sometimes we do not need any sophisticated CLI or configuration files.

Here’s a commented solution to the RMS calculation task. Study carefully - there is a lot of functionality packed into this short script.

import numpy as np
from pyrocko import progress
from pyrocko.util import setup_logging, time_to_str, str_to_time_fillup as stt
from pyrocko.squirrel import Squirrel

# Enable logging and progress bars:
setup_logging('squirrel_rms1.py', 'info')
progress.set_default_viewer('terminal')   # or 'log' to just log progress


def rms(data):
    return np.sqrt(np.sum(data**2))


# Time span (Hunga Tonga explosion was on 2022-01-15):
tmin = stt('2022-01-14')
tmax = stt('2022-01-17')

# Frequency band for restitution:
fmin = 0.01
fmax = 0.05

# All data access will happen through a single Squirrel instance:
sq = Squirrel()

# Uncomment to add local data directories. This will only index the waveform
# headers and not load the actual waveform data at this point. If the files are
# already known and unmodified, it will use cached information.
# sq.add('data/2022')

# Add online data source. Enable access to BGR's FDSN web service and restrict
# to GR network and LH channels only:
sq.add_fdsn('bgr', query_args=dict(network='GR', channel='LH?'))

# Ensure meta-data from online sources is up to date for the selected time
# span:
sq.update(tmin=tmin, tmax=tmax)

# Allow waveform download for station BFO. This does not download anything
# yet, it just enables downloads later, when needed. Omit `tmin`, `tmax`,
# and `codes` to enable download of everything selected in `add_fdsn(...)`.
sq.update_waveform_promises(tmin=tmin, tmax=tmax, codes='*.BFO.*.*')

# Make sure instrument response information is available for the selected data.
sq.update_responses(tmin=tmin, tmax=tmax, codes='*.BFO.*.*')

# Time length for padding (half of overlap).
tpad = 1.0 / fmin

# Iterate window-wise, with some overlap over the data:
for batch in sq.chopper_waveforms(
        tmin=tmin,
        tmax=tmax,
        codes='*.*.*.LHZ',
        tinc=3600.,              # One hour time windows (payload).
        tpad=tpad,               # Add padding to absorb processing artifacts.
        want_incomplete=False,   # Skip incomplete windows.
        snap_window=True):       # Start all windows at full hours.

    for tr in batch.traces:
        resp = sq.get_response(tr).get_effective()

        # Restitution via spectral division. This will introduce artifacts
        # in the padding area in the beginning and end of the trace.
        tr = tr.transfer(
            tpad,                  # Fade in / fade out length.
            (0.5*fmin, fmin,
             fmax, 2.0*fmax),      # Frequency taper.
            resp,                  # Complex frequency response of instrument.
            invert=True,           # Use inverse of instrument response.
            cut_off_fading=False)  # Disable internal trimming.

        # Cut off the contaminated padding. Trim to the payload interval.
        tr.chop(batch.tmin, batch.tmax)

        # Print channel codes, timestamp and RMS value of the hour.
        print(tr.str_codes, time_to_str(batch.tmin), rms(tr.ydata))

Note

In the code above, we have shown three different ways of restricting the channels to be processed. Each of these narrows down the set of channels available but they have very different effects:

  • The query_args argument of sq.add_fdsn() restricts the meta-data known to Squirrel by limiting what is fetched from the FDSN station web service. Using query_args has the benefit of reducing the size of the meta-data to transmit and parse. However changing it later will result in new queries to be made. If unchanged, the locally cached meta-data will be used when the program is rerun.

  • By calling sq.update_waveform_promises() we give Squirrel permission to download certain waveforms by creating so-called waveform promises for matching channels and time intervals. If we call the method without specifying the codes argument, waveforms from all stations/channels are marked as downloadable. Squirrel will not download anything if we do not call sq.update_waveform_promises().

  • The codes argument of sq.chopper_waveforms() selects from the waveforms we have locally and from the waveform promises set up earlier. Downloading will happen just as needed and in chunks of reasonable size.

Before we actually run the script we will first create a local Squirrel environment, so that all the downloaded files as well as the database are stored in the current directory under .squirrel/. This will make it easier to clean up when we are done (rm -rf .squirrel/). If we omit this step, the user’s global Squirrel environment (~/.pyrocko/cache/squirrel/) is used.

Create local environment (optional):

$ squirrel init

And now let’s run our script:

$ python squirrel_rms1.py
[...]
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" metadata: querying...
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" metadata: new (expires: never)
[...]
squirrel_rms1.py:psq.base        - INFO - Waveform orders standing for download: 1 (1)
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" waveforms: downloading, 1 order: GR.BFO..LHZ
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" waveforms: 1 download successful
[...]
GR.BFO..LHZ. 2022-01-14 00:00:00.000 1663.1710971934713
GR.BFO..LHZ. 2022-01-14 01:00:00.000 1773.5581525847992
GR.BFO..LHZ. 2022-01-14 02:00:00.000 1688.5986175096787
[...]
squirrel_rms1.py:psq.base        - INFO - Waveform orders standing for download: 1 (1)
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" waveforms: downloading, 1 order: GR.BFO..LHZ
squirrel_rms1.py:psq.client.fdsn - INFO - FDSN "bgr" waveforms: 1 download successful
GR.BFO..LHZ. 2022-01-14 22:00:00.000 1570.7909549562307
GR.BFO..LHZ. 2022-01-14 23:00:00.000 1595.3630840478215
GR.BFO..LHZ. 2022-01-15 00:00:00.000 1445.7303611595091
[...]

Excellent! It is downloading waveform data and calculating RMS values.

The lines with the RMS values are printed to stdout, while log messages go to stderr. Like this, we could for example redirect only the RMS results to a file but still watch the log messages in the terminal:

$ python squirrel_rms1.py > rms-GR.BFO..LHZ.txt

Running the script a second time is way faster, because nothing has to be downloaded.

Not very flexible though with all the hard-coded settings in the script. Read on to see how we can configure data access from the command line.

Variant 2: basic CLI app

Instead of hard-coding the data sources in the script, we could set them with command line arguments. The pyrocko.squirrel.tool module offers functionality to set up a program so that it accepts the same options and arguments like for example squirrel scan. Here’s the complete program after changing it to use SquirrelArgumentParser:

squirrel_rms2.py - Notable differences to Variant 1 highlighted.
#!/usr/bin/env python3
import numpy as np
from pyrocko.util import time_to_str, str_to_time_fillup as stt
from pyrocko import squirrel


def rms(data):
    return np.sqrt(np.sum(data**2))


tmin = stt('2022-01-14')
tmax = stt('2022-01-17')

fmin = 0.01
fmax = 0.05

# Setup Squirrel instance with waveforms from files given on command line.
# Supports all common options offered by Squirrel tool commands like `squirrel
# scan`:  --help, --loglevel, --progress, --add, --include, --exclude,
# --optimistic, --format, --kind, --persistent, --update, --dataset

parser = squirrel.SquirrelArgumentParser(
    description='Report hourly RMS values.')

parser.add_squirrel_selection_arguments()
args = parser.parse_args()
sq = args.make_squirrel()

sq.update(tmin=tmin, tmax=tmax)
sq.update_waveform_promises(tmin=tmin, tmax=tmax, codes='*.BFO.*.*')
sq.update_responses(tmin=tmin, tmax=tmax, codes='*.BFO.*.*')

tpad = 1.0 / fmin

for batch in sq.chopper_waveforms(
        tmin=tmin,
        tmax=tmax,
        codes='*.*.*.LHZ',
        tinc=3600.,
        tpad=tpad,
        want_incomplete=False,
        snap_window=True):

    for tr in batch.traces:
        resp = sq.get_response(tr).get_effective()
        tr = tr.transfer(
            tpad, (0.5*fmin, fmin, fmax, 2.0*fmax), resp, invert=True,
            cut_off_fading=False)

        tr.chop(batch.tmin, batch.tmax)
        print(tr.str_codes, time_to_str(batch.tmin), rms(tr.ydata))

SquirrelArgumentParser inherits from argparse.ArgumentParser from the Python Standard Library but has a few extra features useful when working with squirrels.

Note

It is also possible to add Squirrel’s standard CLI options to the standard argparse.ArgumentParser. This may be useful when extending an existing app. An example is provided in squirrel_rms4.py.

To get RMS values of some local data in the directory data/2022, we could now run

$ python squirrel_rms2.py --add data/2022

The tool is also self-documenting (--help):

$ python squirrel_rms2.py --help
usage: squirrel_rms2.py [--help] [--loglevel LEVEL] [--progress DEST]
                    [--add PATH [PATH ...]] [--include REGEX]
                    [--exclude REGEX] [--optimistic] [--format FORMAT]
                    [--add-only KINDS] [--persistent NAME] [--update]
                    [--dataset FILE]

Report hourly RMS values.

General options:
  --help, -h            Show this help message and exit.
  --loglevel LEVEL      Set logger level. Choices: critical, error, warning,
                        info, debug. Default: info.
  --progress DEST       Set how progress status is reported. Choices: terminal,
                        log, off. Default: terminal.

Data collection options:
  --add PATH [PATH ...], -a PATH [PATH ...]
                        Add files and directories with waveforms, metadata and
                        events. Content is indexed and added to the temporary
                        (default) or persistent (see --persistent) data
                        selection.

[...]

  --dataset FILE, -d FILE
                        Add files, directories and remote sources from dataset
                        description file. This option can be repeated to add
                        multiple datasets. Run `squirrel template` to obtain
                        examples of dataset description files.

So, to use a remote data source we can create a dataset description file and pass this to --dataset. Examples of such dataset description files are provided by the squirrel template command. By chance there already is an example for accessing all LH channels from BGR’s FDSN web service! We can save the example dataset description file with

$ squirrel template bgr-gr-lh.dataset -w
squirrel:psq.cli.template - INFO - File written: bgr-gr-lh.dataset.yaml

The dataset description is a nicely commented YAML file and we could modify it to our liking.

bgr-gr-lh.dataset.yaml
--- !squirrel.Dataset

# All file paths given below are treated relative to the location of this
# configuration file. Here we may give a common prefix. For example, if the
# configuration file is in the sub-directory 'PROJECT/config/', set it to '..'
# so that all paths are relative to 'PROJECT/'.
path_prefix: '.'

# Data sources to be added (LocalData, FDSNSource, CatalogSource, ...)
sources:
- !squirrel.FDSNSource

  # URL or alias of FDSN site.
  site: bgr

  # FDSN query arguments to make metadata queries.
  # See http://www.fdsn.org/webservices/fdsnws-station-1.1.pdf
  # Time span arguments should not be added here, because they are handled
  # automatically by Squirrel.
  query_args:
    network: 'GR'
    channel: 'LH?'

Expert users can get a non-commented version of the file by adding --format brief to the squirrel template command.

To calculate RMS values for the configured dataset, we can now run

$ python squirrel_rms2.py --dataset bgr-gr-lh.dataset.yaml
[...]
GR.BFO..LHZ. 2022-01-14 00:00:00.000 1663.1710971934713
GR.BFO..LHZ. 2022-01-14 01:00:00.000 1773.5581525847992
GR.BFO..LHZ. 2022-01-14 02:00:00.000 1688.5986175096787
[...]

This is a bit more flexible because we can now easily exchange the data used from the command line. But there is still room for improvements. Read on to see how we can create a nicely structured program supporting multiple subcommands.

Variant 3: structured CLI app

In this next iteration of our example RMS app, we will:

  • Improve the program structure by the use of SquirrelCommand and squirrel.run. This will also enable catching certain exceptions and reporting failure conditions in a consistent way.

  • Add options to select the channels and time spans to be processed (--codes, --tmin, --tmax)

  • Add options to select the frequency range (--fmin, --fmax).

  • Support multiple subcommands: ./squirrel_rms3.py rms will report the RMS values just like before and ./squirrel_rms3.py plot is there to plot the RMS values as a function of time. The latter is left unimplemented as an exercise for the reader.

For each subcommand we want to support, we will create a subclass of SquirrelCommand and overload the methods make_subparser(), setup(), and run(). The name and description of the subcommand is configured in make_subparser(). In setup(), we can configure the parser and add custom arguments. run() will be called after command line arguments have been processed. Finally, we put everything together with a single call to squirrel.run. This will process arguments and dispatch to the appropriate subcommand’s run() or print a help message if no subcommand is selected.

Here’s the final implementation of the RMS tool:

#!/usr/bin/env python3
import numpy as np
from pyrocko.util import time_to_str
from pyrocko import squirrel


def rms(data):
    return np.sqrt(np.sum(data**2))


class ReportRMSTool(squirrel.SquirrelCommand):

    def make_subparser(self, subparsers):
        return subparsers.add_parser(
            'rms', help='Report hourly RMS values.')

    def setup(self, parser):
        parser.add_squirrel_selection_arguments()

        # Add '--codes', '--tmin' and '--tmax', but not '--time'.
        parser.add_squirrel_query_arguments(without=['time'])

        parser.add_argument(
            '--fmin',
            dest='fmin',
            metavar='FLOAT',
            default=0.01,
            type=float,
            help='Corner of highpass [Hz].')

        parser.add_argument(
            '--fmax',
            dest='fmax',
            metavar='FLOAT',
            default=0.05,
            type=float,
            help='Corner of lowpass [Hz].')

    def run(self, parser, args):
        sq = args.make_squirrel()

        fmin = args.fmin
        fmax = args.fmax

        query = args.squirrel_query
        sq.update(**query)
        sq.update_waveform_promises(**query)
        sq.update_responses(**query)

        tpad = 1.0 / fmin

        for batch in sq.chopper_waveforms(
                tinc=3600.,
                tpad=tpad,
                want_incomplete=False,
                snap_window=True,
                **query):

            for tr in batch.traces:
                resp = sq.get_response(tr).get_effective()
                tr = tr.transfer(
                    tpad, (0.5*fmin, fmin, fmax, 2.0*fmax), resp, invert=True,
                    cut_off_fading=False)

                tr.chop(batch.tmin, batch.tmax)
                print(tr.str_codes, time_to_str(batch.tmin), rms(tr.ydata))


class PlotRMSTool(squirrel.SquirrelCommand):

    def make_subparser(self, subparsers):
        return subparsers.add_parser(
            'plot', help='Plot RMS traces.')

    def run(self, parser, args):
        self.fail('Not implemented yet!')


squirrel.run(
    subcommands=[ReportRMSTool(), PlotRMSTool()],
    description='My favourite RMS tools.')

Note

If we do not need multiple subcommands, we can still use the same structure of our program. We can pass a single SquirrelCommand to the command argument of squirrel.run:

squirrel.run(
    command=PlotRMSTool(),
    description='Report hourly RMS values.')

Now we can easily change the time span, station, or channel with command line arguments.

$ python squirrel_rms3.py rms \
    --dataset bgr-gr-lh.dataset.yaml \
    --tmin=2020-01-01 --tmax=2020-01-02 \
    --codes='*.BFO.*.LH?'
GR.BFO..LHE. 2020-01-01 00:00:00.000 46429.75461920944
GR.BFO..LHN. 2020-01-01 00:00:00.000 33135.87662941785
GR.BFO..LHZ. 2020-01-01 00:00:00.000 34431.2620593553
GR.BFO..LHE. 2020-01-01 01:00:00.000 46297.0737952195
GR.BFO..LHN. 2020-01-01 01:00:00.000 34239.18093354454
GR.BFO..LHZ. 2020-01-01 01:00:00.000 32941.96682045564
[...]
$ python squirrel_rms3.py plot
squirrel_rms3.py:psq.tool.common - CRITICAL - Not implemented yet!

And now you can start implementing the plot subcommand!

Summary

In this tutorial we have explored different ways of how to structure a Squirrel based continuous waveform processing program and how to make use of Squirrel’s CLI tool helpers. These helpers offer a clean and easy way to configure data sources in a consistent fashion across multiple seismological applications. They provide standardized program options allowing users to familiarize with a new application more easily. Following the suggested structure also greatly simplifies reuse of existing functionality in new programs.