Specializing Databases for Different Tasks

The database server has to maintain a lot of different types of information with different access patterns. For example, the database knows that DELL stands for DELL Inc., that it trades on the NASDAQ Mon-Fri from 9:00am to 4:00 pm (ET), and that on 5/8/96 it closed at 1.4062. To support these different types of data and to optimize for the different expected usage, the MIM server manages four different databases. The use of these different databases is totally transparent to the user. In fact, it is transparent to many parts of the system itself.

The following details the four different databases used. The first two databases manage all the information needed for each ticker symbol. The second two manage the time series data.

In-memory (Memory-mapped) Schema Database

Extensional Database

Time Series Database

Current Tick Database

Ticker Symbol Databases

There is a lot of information associated with each symbol in the database (name, description, exchange, trading time, etc.). All this information is stored either in the in-memory schema database or in the extensional database. Some of the information is necessary for the system to be operational in a minimal form and is needed all the time:

  • Name (Ticker Symbol) - used for parsing queries, displaying menus, etc.

  • Inheritance Tree - because of searches

Since these items are used so frequently, the information is stored in-memory. Auxiliary information where the access need is much less critical/frequent is not kept in-memory but is stored on disk instead. This auxiliary information is kept in the extensional database.

There is a balance between what needs to be stored in-memory for efficiency versus what should be stored on disk. Currently, we keep a lot of information in-memory, using memory mapped files to minimize the impact on swap space requirements.

Use of the terms schema and extensional are historical. The primary purpose of these databases is to store information about the time series data so the term schema was appropriate initially. Subsequently, the extensional database was viewed as an extension to the schema. However, currently we store hundreds of thousands of symbols in the database with a lot of information associated with each one so these really are databases in and of themselves.

The schema is an object-oriented database supporting inheritance. It manages a set of relations, corresponding to ticker symbols (DELL); columns, corresponding to attributes of relations (Close); and relation columns, corresponding to the historical data for a ticker symbol (Close of DELL). The extensional database, on the other hand, is relational in nature, utilizing an optimized B-tree variant to effect very fast disk retrieval.

The relations and columns are organized in a hierarchical tree structure which is used to inherit both attributes of tickers and values for these attributes. For example, when the ticker CPI is added to Economic Indicators it inherits a sparse column. Similarly, when SP is added to Futures, it inherits price columns as well as Volume and Open Interest. Other inherited properties are the data type of the relation columns (e.g., Volume of DELL is an integer), and the exchange where it trades, etc. Some of the information is inherited via multiple paths.

Trading times and trading patterns are not inherited.

Multiple relation and column hierarchies can be defined by using aliases. This allows, for example, symbols to be organized by region as well as by industry. As many paths to a given symbol as desired can be defined. Symbols can also be located by using the custom search facility. For example, a custom search can return all equities in a particular industry or index. Another custom search could locate a spread between two currencies.

There are many other objects and types. Some examples include:

  • Description - String providing user-defined explanation of a relation.

  • Data Source - User defined data for a relation column.

  • Rollover Policy - Sophisticated language for specification of rollover switching date and policy (e.g., rollover on second Friday of expiration month using a 3-month logarithmic perpetual policy). Rollover information is associated with relations and stored in the database. Continuous contracts are generated automatically and the system updates these automatically.

  • Trading Patterns - Non-standard trading ranges, including weekend days and non-contiguous trading days/times (e.g., Mon, Wed, Fri, Sat.; 8am-2pm, 3pm-5pm).

  • Segments - Time-varying treatment of trading periods (e.g., Mon-Sat until 1960 and Mon-Fri after that).

  • Aggregation Rules - Rules for combining data into desired frequency (e.g., averaging, sum, last value, first value, high, low).

  • Sparse Data - Optimized storage and handling for this data type since it is conceptually a different object: the last value is applicable until a new one comes out (e.g., CPI report).

  • Exchanges - Where a relation trades.

  • Time Zones - Normalization factor for when a relation trades.

  • Holiday Schedules - Alternative holiday schedules (e.g., foreign) available on a per-relation or per-exchange basis. System differentiates between holidays and missing data and allows different filling specifications (NaN, fill-forward, fill-backward, linear/geometric/logarithmic interpolation).

  • Entitlements - Read/write access privileges to data defined for specific users or groups of users (e.g., who has access to IBES data or energy futures data or even down to the granularity of a specific column of a specific relation, e.g., Cash of CL).

  • Corrections History - Old values of the data maintained for risk evaluation.

  • Reaper Tables - Management facility for controlling the growth of data in the system. Tables are used to store user-defined horizons beyond which the data should not be stored. Particularly useful for current tick data.

  • User-defined - Storage for other information defined by the user (e.g., website, address of company).

Time Series Databases

Conceptually, a time series is a function that takes a date/time and returns a value. This simple concept is complicated in two important ways. First, a time series need not have a value for a particular date and time. In fact, it should behave as if certain dates and times do not exist, e.g., weekends. Secondly, a time series may have more than one value for a specific date and time, e.g., data coming off a real-time feed. When these complications are factored in, a general time series resembles a relation instead of a function or even a partial function. From an implementation viewpoint, the simple conceptual model yields a much more tractable solution. This leaves the designer with a difficult choice.

  • Ignore the extremes. Implement only the more common classes (daily, M-F) and ignore the rest. This approach is easy to code and gives order (constant speed) access time. This is the approach taken by most of our competitors.

  • Implement the worst case. Treat all time series uniformly to achieve full-blown generality. This approach is very flexible but results in slower (order log(n)) access for all time series because there is no way to exploit the regularity of the data. This is the relational approach.

  • Implement special cases. Program special cases by writing lots of classes to exploit the regularity of the data, choosing the appropriate implementation for a specific time series at run-time. In some cases the performance will be as good as the best case and in others, it will be comparable to the general case. The actual performance depends on the specific time series. Probabilistically, the performance will be quite good since, by assumption, the more common cases can be accessed in constant time. This approach is both fast and flexible but difficult to program.

The MIM database implementation takes the third approach. The implementation breaks time series into two orthogonal halves, a time component and a storage component, each with many special cases.

  • Time component: Calendar Abstraction

  • Compresses the date/time space into a more rational sequence: converts date/time into indices by compressing holes.

  • Calendar is the key of time series so important to do a good job of it.

  • Perfect place to exploit regularity due to all the variety: easy calendar (M-F) is so common and nice that get large bang for buck by special casing.

  • Calendar handles: different frequencies: n seconds, minutes, hours, days, weeks, months, quarters, years. - date arithmetic (e.g., 10 days ago) - contiguous/noncontiguous trading days and times - multi-segment (time-varying trading patterns)

  • Conceptually since days are converted to indices, all that’s left is an array lookup. This is true in some cases but not in the more complicated ones.

  • Storage component

  • Gets location and figures out a value.

  • Handles: - multi-value data (e.g., real tick where multiple values per time stamp are stored) - multi-field data (multiple fields per value) - sparse/non-sparse data - float, integer, double data types for single-field - gross granularity (tick storage different than daily)

  • When retrieving, time series sometimes kept in memory and sometimes not (daily fits but tick does not) so some read in all and process whereas others read in pages at a time.

  • Depending on how much data involved and type will have arrays, hashes, nested arrays, B-trees, etc. Some time series are as simple as arrays; some are disk-based objects that are trees so lookup involves traversing a tree with all the complications of virtual pager/paging policies, caching in and out, modification of buffers, functions to find the right bucket, etc.

  • Options time series have their own special storage class because they have strike prices, calls/puts, expiration dates to handle. - Could use regular storage class to store options but replicating information if do this so we use a special storage class to renormalize and avoid replication. - Options are inherently multi-valued at the physical layer (in one date/time slot can have more than one option trading).

  • Compression of data is an interesting issue because one cannot just simply take time series history, compress it and store it on a disk. - Some history is very large and one would not want to compress/uncompress it every time there is an update. - We break it up into chunks and access via discrimination nets. Buckets are compressed by using the tree to walk down the hierarchy until a chunk of the right size is obtained. To read, the appropriate bucket is found, read into memory and uncompressed. Block sizes are tunable in the system. - We compress tick data but not daily. This has empirically been determined to be the correct trade-off. It is a big win to compress tick data due to the magnitude of data: the savings in I/O time more than makes up for the overhead of compression. But, with daily data, you lose in terms of speed with no appreciable gain so compression is not used. - We use a standard compression algorithm which has proven to be quite acceptable. As an example, compression was turned off for an update and the database grew to 350MB vs. 80MB when compression was employed. - Compression can be specified on a per-time series basis and thresholds set for the data size, below which compression will not be used.

By splitting up time series into these two components, we achieve a great deal of code reuse. Rather than special case all possible combinations of calendar and storage classes, we need only implement each calendar and storage class and then combine the two. We can achieve even more code reuse by using inheritance (both is-a and has-a) and templates to implement the calendar as well as the storage classes. For example, a has-a hierarchy is a natural way to handle multi-segment calendars, with care taken around segment boundaries. This benefits not only the coding, but also the execution, since one or more of the segments may be special cased (e.g., M-F).

The net result is a very complex virtual C++ class hierarchy implementing time series. However, once an object is created, interactions with it are simple since the class hierarchy implements a unified interface. Moreover, once a public virtual call is made, nested calls are non-virtual; hence, the performance penalty of using virtual classes is minimized. The complex task of instantiating a specific time series object is delegated to a time series factory which, among other things, caches the more frequently used time series. The factory also interacts with the current tick database to ensure that the newly-created time series has the most recent values.

Consider a simple time series example:

  • Single-valued, float

  • Daily

  • Mon-Fri

This time series could represent the closing price of DELL. Since the trading pattern is so regular, a constant order calendar can be used. Moreover, a 10 year history of this time series requires only 2500 values so an array is the appropriate storage choice.

In contrast, consider the following worst case example:

  • multi-value data

  • multi-field non-aligned (e.g., float, char(1), int – not mod 4 address so can’t use integer to integer copy in the system to get the value)

  • multi-segment (trades Mon-Sat and then switches to Mon-Fri)

  • noncontiguous in both dimensions (daily: Mon, Wed, Thur, Fri, Sat; tick: 8-2, 3-5)

There is nothing to exploit with this time series. Hence, both a general calendar and storage class are required.

To improve performance, the system relies on caching. Most time series processing accesses the time series elements either sequentially or in sequential windows, hence caching can result in significant speed-ups. For example, the calendar class caches the “current” segment in a multi-segment time series. The storage class caches the location of a value in a multi-valued time series, which avoids doing an order log (n) search for nearby values.

In other cases, where the time series has some but not all elements of the simple case, we can still exploit whatever regularity exists. For example, if a time series is single-valued, float, with a single non-contiguous segment, we can use a simple storage class with a non-contiguous, single-segment calendar class. We pay the penalty for using a non-contiguous trading period, but we can still optimize the remainder of the time series, resulting in constant-access time.

The time series class is designed to provide fast access, and does so at the expense of slower update speed. To bridge the gap with real-time feeds, where update speed is the most critical factor, the MIM server uses a separate database specialized for extremely fast update (approximately 1000 transactions per second.) This database is used by the time series factory to fetch the latest values when a time series is created. Thus, the users of the time series class are unaware of the current tick database’s existence.

The data stored in the current tick database can be permanently merged with the historical time series data. This facilitates a way to do updates during the day (either intraday tick or real tick) and, at the end of the day either migrate the current tick data into the historical data or discard it in favor of an update that has been processed and “cleaned”. Folding into the historical data can be done selectively and the data can be folded into daily, intraday or real tick with aggregation done automatically by the system as appropriate.

Putting It All Together

To illustrate how all the system’s databases inter-operate, consider a simple XMIM query:

   SHOW 
     1: BAR of DELL 
   WHEN 
       Date is within 100 days

First of all, the system will use the in-memory schema to look up the symbol DELL and decide to use the DELL calendar to execute the query (since it is the only symbol present in the query). The choice of calendar gives meaning to the phrase 100 days. Then the in-memory schema and extensional database are accessed to get useful information on DELL such as:

  • Precision to be displayed in.

  • Holiday schedule and NaN filling rules.

  • Whether the data is sparse or non-sparse.

  • The system will then go to disk and read the time series from storage.

  • Then the current tick database will be accessed to find if there are any recent prices for DELL that need to be included.

  • Caching is involved such that if any of the Bar prices for DELL are asked for again, the same thing will be returned, i.e., the most recent prices of DELL affect all four components of the Bar, but it would be a gross mistake to consider a more recent price when computing the Close component of DELL than when computing its High.

  • The execution code actually does not know of the four different databases, as far as it is concerned, everything is in memory.

  • The execution engine has all it needs:

    • Time series objects - can ask for value in this range

    • Calendar object - figure out what date is within 100 days

After all the information is accessed then it is just a matter of report generation.