IRIS manage a large amount of seismic data. To help users access that data they are constructing a database of metrics — values derived from the raw data. The project has three main components: a coordinating process; a system that retrieves data and calculates metrics; and a store that saves metrics and provides a web interface for their retrieval. I was responsible for the design and implementation (in Java, with Spring and PostgreSQL) of the storage component.
Scalable, performant database.
Interactive data exploration.
Data upload and download.
Flexibility to accomodate future metrics.
The data model has three basic types: metrics, targets and measurements. Metrics describe the structure and meaning of measurements (values over intervals) for targets. A simple metric might be “signal RMS” and the measurements would then be RMS value, and start and end time of sample, and the targets are channels (more precisely, the data streams from the channels).
After discussion with the client we decided to avoid any cross–metric logic in the query interface. This decouples metrics in the schema, which allows application–level sharding: each metric can be placed in a separate database (note: the design enables sharding, but the current implementation uses a single database connection for all metrics).
We still need to handle targets, which are common across metrics, but analysis showed that the number of distinct targets is sufficiently small for them to be cached in memory and provided (using PostgreSQL arrays to minimise data volume) within queries.
At the Java level, all data loading and retrieval is generic, parameterised by the metric interface. Metric implementations are loaded from the classpath using the Java service provider API. When the server starts new metrics can construct suitable tables in the schema. So the addition of a new metric requires (only) the deployment of a jar to the server containing a new metric subclass — typically a handful lines of code are required.
As proof of concept, the final deliverable includes, in addition to the metrics currently available (which tend to be simple float values), a metric whose measurements are distributions (spectra, stored in a second table and retrieved via a join).
Targets can be complex: a network, site, location, channel, or ratio between these. And the relationship between targets is hierarchical. This could have lead to an extremely complex data model, but working with the client and using existing standards (SNCL codes) gave a very elegant encoding with a flexible interface.
For example, targets can be selected by matching components (station and network for example). In extreme cases, a list of targets can be suppplied; the format matches that returned from target searches, so a user can do an initial search for targets then hand–tune the list (in an editor or with a script) before querying for measurements.
This is just one example of how a cooperative, responsive client, aware of technical issues, was able to work with us to achieve the best possible design.
The only significant problem to surface during system validation was uncertainty about data validation. For efficiency, data are streamed to the database in a single transaction, which makes detailed error reporting / handling difficult. The final, pragmatic solution was to use IEEE double storage for most values, which avoids range errors (the most common) by mirroring the representation used in the calculations.