squirrel.base

class Selection(database, persistent=None)[source]

Bases: object

Database backed file selection (base class for Squirrel).

Parameters:
  • database (Database or str) – Database instance or file path to database.
  • persistent (str) – If given a name, create a persistent selection.

In the Squirrel framework, a selection is conceptually a list of files to be made available in the application. Instead of using Selection directly, user applications should usually use its subclass Squirrel which adds content indices to the selection and provides high level data querying.

By default, a temporary table in the database is created to hold the names of the files in the selection. This table is only visible inside the application which created it. If a name is given to persistent, a named selection is created, which is visible also in other applications using the same database.

Besides the filename references, desired content kind masks and file format indications are stored in the selection’s database table to make the user choice regarding these options persistent on a per-file basis. Book-keeping on whether files are unknown, known or if modification checks are forced is also handled in the selection’s file-state table.

Paths of files can be added to the selection using the add() method and removed with remove(). undig_grouped() can be used to iterate over all content known to the selection.

get_database()[source]

Get the database to which this selection belongs.

Returns:Database object
add(paths, kind_mask=255, format='detect')[source]

Add files to the selection.

Parameters:paths (iterator yielding str objects) – Paths to files to be added to the selection.
remove(paths)[source]

Remove files from the selection.

Parameters:paths (list of str) – Paths to files to be removed from the selection.
iter_paths()[source]

Iterate over all file paths currently belonging to the selection.

Returns:Iterator yielding file paths.
get_paths()[source]

Get all file paths currently belonging to the selection.

Returns:List of file paths.
undig_grouped(skip_unchanged=False)[source]

Get inventory of cached content for all files in the selection.

Param:skip_unchanged: if True only inventory of modified files is yielded (flag_modified() must be called beforehand).

This generator yields tuples ((format, path), nuts) where path is the path to the file, format is the format assignation or 'detect' and nuts is a list of Nut objects representing the contents of the file.

flag_modified(check=True)[source]

Mark files which have been modified.

Parameters:check – If True query modification times of known files on disk. If False, only flag unknown files.

Assumes file state is 0 for newly added files, 1 for files added again to the selection (forces check), or 2 for all others (no checking is done for those).

Sets file state to 0 for unknown or modified files, 2 for known and not modified files.

class SquirrelStats(**kwargs)[source]

Bases: pyrocko.guts.Object

Container to hold statistics about contents available from a Squirrel.

See also Squirrel.get_stats().

nfiles

int

Number of files in selection.

nnuts

int

Number of index nuts in selection.

codes

list of tuple of str objects objects, default: []

Available code sequences in selection, e.g. (agency, network, station, location) for stations nuts.

kinds

list of str objects, default: []

Available content types in selection.

total_size

int

Aggregated file size of files is selection.

counts

dict of dict of int objects objects, default: {}

Breakdown of how many nuts of any content type and code sequence are available in selection, counts[kind][codes].

tmin

builtins.float (pyrocko.guts.Timestamp), optional

Earliest start time of all nuts in selection.

tmax

builtins.float (pyrocko.guts.Timestamp), optional

Latest end time of all nuts in selection.

class Squirrel(env=None, database=None, cache_path=None, persistent=None)[source]

Bases: pyrocko.squirrel.base.Selection

Prompt, lazy, indexing, caching, dynamic seismological dataset access.

Parameters:
  • env (SquirrelEnvironment or str) – Squirrel environment instance or directory path to use as starting point for its detection. By default, the current directory is used as starting point. When searching for a usable environment the directory '.squirrel' or 'squirrel' in the current (or starting point) directory is used if it exists, otherwise the parent directories are search upwards for the existence of such a directory. If no such directory is found, the user’s global Squirrel environment '$HOME/.pyrocko/squirrel' is used.
  • database (Database or str) – Database instance or path to database. By default the database found in the detected Squirrel environment is used.
  • cache_path (str) – Directory path to use for data caching. By default, the 'cache' directory in the detected Squirrel environment is used.
  • persistent (str) – If given a name, create a persistent selection.

Provides a unified interface to query seismic waveforms, station and sensor metadata, and event information from local file collections and remote data sources. Query results are promptly returned, even for very large collections, thanks to a highly optimized database setup working behind the scenes. Assemblage of a data selection is very fast for known files as all content indices are cached in a database. Unknown files are automatically indexed when added to the selection.

Features

  • Efficient[1] lookup of data relevant for a selected time window.
  • Metadata caching and indexing.
  • Modified files are re-indexed as needed.
  • SQL database (sqlite) is used behind the scenes.
  • Can handle selections with millions of files.
  • Data can be added and removed at run-time, efficiently[1].
  • Just-in-time download of missing data.
  • Disk-cache of meta-data query results with expiration time.
  • Efficient event catalog synchronization.
  • Always-up-to-date data coverage indices.
  • Always-up-to-date indices of available station/channel codes.

[1] O log N performance, where N is the number of data entities (nuts).

Queries are restricted to the contents offered by the files which have been added to the Squirrel (which usually is a subset of the information collected in the attached global file meta-information database).

By default, temporary tables are created in the attached database to hold the names of the files in the selection as well as various indices and counters. These tables are only visible inside the application which created it. If a name is given to persistent, a named selection is created, which is visible also in other applications using the same database.

Paths of files can be added to the selection using the add() method.

add(paths, kinds=None, format='detect', check=True, progress_viewer='terminal')[source]

Add files to the selection.

Parameters:
  • paths (list of str) – Iterator yielding paths to files or directories to be added to the selection. Recurses into directories. If given a str, it is treated as a single path to be added.
  • kinds (list of str) – Content types to be made available through the Squirrel selection. By default, all known content types are accepted.
  • format (str) – File format identifier or 'detect' to enable auto-detection.

Complexity: O(log N)

reload()[source]

Check for modifications and reindex modified files.

Based on file modification times.

add_virtual(nuts, virtual_paths=None)[source]

Add content which is not backed by files.

Parameters:
  • nuts (iterator yielding Nut objects) – Content pieces to be added.
  • virtual_paths (list of str) – List of virtual paths to prevent creating a temporary list of the nuts while aggregating the file paths for the selection.

Stores to the main database and the selection.

add_source(source)[source]

Add remote resource.

Parameters:source (subclass of Source) – Remote data access client instance.
add_fdsn(*args, **kwargs)[source]

Add FDSN site for transparent remote data access.

Arguments are passed to FDSNSource.

add_catalog(*args, **kwargs)[source]

Add online catalog for transparent event data access.

Arguments are passed to CatalogSource.

iter_nuts(kind=None, tmin=None, tmax=None, codes=None, naiv=False, kind_codes_ids=None)[source]

Iterate content entities matching given constraints.

Parameters:
  • kind (str, list of str) – Content kind (or kinds) to extract.
  • tmin (timestamp) – Start time of query interval.
  • tmax (timestamp) – End time of query interval.
  • codes (tuple of str) – Pattern of content codes to be matched.
  • naiv (bool) – Bypass time span lookup through indices (slow, for testing).
  • kind_codes_ids (list of str) – Kind-codes IDs of contents to be retrieved (internal use).

Complexity: O(log N) for the time selection part due to heavy use of database indices.

Yields Nut objects representing the intersecting content.

Query time span is treated as a half-open interval [tmin, tmax). However, if tmin equals tmax, the edge logics are modified to closed-interval so that content intersecting with the time instant t = tmin = tmax is returned (otherwise nothing would be returned as [t, t) never matches anything).

Time spans of content entities to be matched are also treated as half open intervals, e.g. content span [0, 1) is matched by query span [0, 1) but not by [-1, 0) or [1, 2). Also here, logics are modified to closed-interval when the content time span is an empty interval, i.e. to indicate a time instant. E.g. time instant 0 is matched by [0, 1) but not by [-1, 0) or [1, 2).

get_nuts(*args, **kwargs)[source]

Get content entities matching given constraints.

Like iter_nuts() but returns results as a list.

get_time_span(kinds=None)[source]

Get time interval over all content in selection.

Complexity O(1), independent of the number of nuts.

Returns:(tmin, tmax)
get_deltat_span(kind)[source]

Get min and max sampling interval of all content of given kind.

Parameters:kind (str) – Content kind
Returns:(deltat_min, deltat_max)
iter_kinds(codes=None)[source]

Iterate over content types available in selection.

Parameters:codes – if given, get kinds only for selected codes identifier

Complexity: O(1), independent of number of nuts

iter_deltats(kind=None)[source]

Iterate over sampling intervals available in selection.

Parameters:kind (str) – if given, get sampling intervals only for a given content type

Complexity: O(1), independent of number of nuts

iter_codes(kind=None)[source]

Iterate over content identifier code sequences available in selection.

Parameters:kind (str) – if given, get codes only for a given content type

Complexity: O(1), independent of number of nuts

iter_counts(kind=None)[source]

Iterate over number of occurrences of any (kind, codes) combination.

Parameters:kind – if given, get counts only for selected content type

Yields tuples ((kind, codes), count)

Complexity: O(1), independent of number of nuts

get_kinds(codes=None)[source]

Get content types available in selection.

Parameters:codes – if given, get kinds only for selected codes identifier

Complexity: O(1), independent of number of nuts

Returns:sorted list of available content types
get_deltats(kind=None)[source]

Get sampling intervals available in selection.

Parameters:kind – if given, get codes only for selected content type

Complexity: O(1), independent of number of nuts

Returns:sorted list of available sampling intervals
get_codes(kind=None)[source]

Get identifier code sequences available in selection.

Parameters:kind – if given, get codes only for selected content type

Complexity: O(1), independent of number of nuts

Returns:sorted list of available codes as tuples of strings
get_counts(kind=None)[source]

Get number of occurrences of any (kind, codes) combination.

Parameters:kind – if given, get codes only for selected content type

Complexity: O(1), independent of number of nuts

Returns:dict with counts[kind][codes] or ``counts[codes] if kind is not None
update(constraint=None, **kwargs)[source]

Update inventory of remote content for a given selection.

This function triggers all attached remote sources, to check for updates in the metadata. The sources will only submit queries when their expiration date has passed, or if the selection spans into previously unseen times or areas.

get_nfiles()[source]

Get number of files in selection.

get_nnuts()[source]

Get number of nuts in selection.

get_total_size()[source]

Get aggregated file size available in selection.

get_stats()[source]

Get statistics on contents available through this selection.

get_content(nut, cache='default', accessor='default')[source]

Get and possibly load full content for a given index entry from file.

Loads the actual content objects (channel, station, waveform, …) from file. For efficiency sibling content (all stuff in the same file segment) will also be loaded as a side effect. The loaded contents are cached in the Squirrel object.

chopper_waveforms(obj=None, tmin=None, tmax=None, time=None, codes=None, tinc=None, tpad=0.0, want_incomplete=True, degap=True, maxgap=5, maxlap=None, snap=(<built-in function round>, <built-in function round>), include_last=False, load_data=True, accessor_id=None, keep_current_files_open=False, **kwargs)[source]

Iterate window-wise over waveform data.

Parameters:
  • tmin – start time (default uses start time of available data)
  • tmax – end time (default uses end time of available data)
  • tinc – time increment (window shift time) (default uses tmax-tmin)
  • tpad – padding time appended on either side of the data windows (window overlap is 2*tpad)
  • trace_selector – filter callback taking pyrocko.trace.Trace objects
  • want_incomplete – if set to False, gappy/incomplete traces are discarded from the results
  • degap – whether to try to connect traces and to remove gaps and overlaps
  • maxgap – maximum gap size in samples which is filled with interpolated samples when degap is True
  • maxlap – maximum overlap size in samples which is removed when degap is True
  • keep_current_files_open – whether to keep cached trace data in memory after the iterator has ended
  • accessor_id – used as a key to identify different points of extraction for the decision of when to release cached trace data (should be used when data is alternately extracted from more than one region / selection)
  • snap – replaces Python’s round() function which is used to determine indices where to start and end the trace data array
  • include_last – whether to include the very last sample
  • load_data – whether to load the waveform data. If set to False, traces with no data samples, but with correct meta-information are returned
Returns:

itererator yielding a list of pyrocko.trace.Trace objects for every extracted time window

get_coverage(kind, tmin=None, tmax=None, codes_list=None, limit=None)[source]

Get coverage information.

Get information about strips of gapless data coverage.

Parameters:
  • kind – Content kind to be queried.
  • tmin – Start time of query interval.
  • tmin – End time of query interval.
  • codes_list – List of code patterns to query. If not given or empty, an empty list is returned.
  • limit – Limit query to return only up to a given maximum number of entries per matching channel (without setting this option, very gappy data could cause the query to execute for a very long time).
Returns:

list of entries of the form (pattern, codes, deltat, tmin, tmax, data) where pattern is the request pattern which yielded this entry, codes are the matching channel codes, tmin and tmax are the global min and max times for which data for this channel is available, regardless of any time restrictions in the query. data is another list with (up to limit) checkpoints of the form (time, count) where a count of zero indicates a data gap, a value of 1 normal data coverage and higher values indicate duplicate/redundant data.

print_tables(table_names=None, stream=None)[source]

Dump raw database tables in textual form (for debugging purposes).

Parameters:
  • table_names (list of str) – Names of tables to be dumped or None to dump all.
  • stream – Open file or None to dump to standard output.
class DatabaseStats(**kwargs)[source]

Bases: pyrocko.guts.Object

Container to hold statistics about contents cached in meta-information db.

nfiles

int

number of files in database

nnuts

int

number of index nuts in database

codes

list of tuple of str objects objects, default: []

available code sequences in database, e.g. (agency, network, station, location) for stations nuts.

kinds

list of str objects, default: []

available content types in database

total_size

int

aggregated file size of files referenced in database

counts

dict of dict of int objects objects, default: {}

breakdown of how many nuts of any content type and code sequence are available in database, counts[kind][codes]

class Database(database_path=':memory:', log_statements=False)[source]

Bases: object

Shared meta-information database used by Squirrel.

dig(nuts)[source]

Store or update content meta-information.

Given nuts are assumed to represent an up-to-date and complete inventory of a set of files. Any old information about these files is first pruned from the database (via database triggers). If such content is part of a live selection, it is also removed there. Then the new content meta-information is inserted into the main database. The content is not automatically inserted into the live selections again. It is in the responsibility of the selection object to perform this step.

remove(path)[source]

Prune content meta-inforamation about a given file.

All content pieces belonging to file path are removed from the main database and any attached live selections (via database triggers).

reset(path)[source]

Prune information associated with a given file, but keep the file path.

This method is called when reading a file failed. File attributes, format, size and modification time are set to NULL. File content meta-information is removed from the database and any attached live selections (via database triggers).

silent_touch(path)[source]

Update modification time of file without initiating reindexing.

Useful to prolong validity period of data with expiration date.