Data Versioning

Data Versioning, Temporal Data, Transaction Time Validity. Different names for maintaining snapshots of data that are consistent across multiple tables at any point in time. Just like Time Machine, but for data. 

This is a description of how it can be implemented with MySQL. 

The basic technique with the use of views is described and illustrated here (under ”The better plan”), we’ve just improved it a bit.

First of all a data table to store information about people. Every time a record is updated, a new record is created. 

The table will hold all historic versions of each people record, including the most current one.

CREATE TABLE people_data (

id_version BIGINT UNSIGNED,

Firstname VARCHAR(255),

Lastname VARCHAR(255),

moddate TIMESTAMP NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,

KEY id_version(id_version)

);

Then we have a table to keep track of versions and hold some meta data about each version. This table is common for all data tables that we want to enable versioning for. This is a variant of the Audit table on http://www.codeproject.com/KB/architecture/LisRecordVersioning.aspx but we use timestamp_start and timestamp_end instead of IsActive, to be able to get the historic state across all data tables of any point in time (this is picked up from http://www.paragon-cs.com/mag/issue7.pdf ).

CREATE TABLE version (

id BIGINT UNSIGNED auto_increment,

guid VARCHAR(20),

changedby_start VARCHAR(100),

changedby_end VARCHAR(100),

dt_start DATETIME,

dt_end DATETIME DEFAULT ’3000-01-01 00:00:00’,

PRIMARY KEY (id),

KEY guid(guid),

KEY dt_start (dt_start),

KEY dt_end (dt_end));

Add a trigger to supply a default value for the timestamp_start since DATETIME can’t take default dynamic value

/* Use trigger to keep track of creation date (http://bugs.mysql.com/bug.php?id=27645) */

CREATE TRIGGER event_insert BEFORE INSERT ON version

   FOR EACH ROW SET NEW.dt_start = NOW();

We need to use a function to be able to use a dynamic condition in the VIEW. This is a workaround for a limitation in MySQL that prevents variables to be used in the definition for a VIEW.

”The SELECT statement cannot refer to system or user variables.” http://dev.mysql.com/doc/refman/5.0/en/create-view.html

Comment: ”A simple workaround for MySQL’s limitation on local variables usage in views is to use a function, which returns variable’s value”

CREATE FUNCTION history()

RETURNS VARCHAR(20)

RETURN IF(@history > 0, @history, ”2999-12-31 00:00:00”);

Finally a view for the data table and a supporting function that will give us the most recent version of every data record.

CREATE VIEW people AS

SELECT people_data.*, version.guid FROM people_data

JOIN version ON people_data.id_version=version.id

WHERE dt_start <= history() AND dt_end > history();

 

This is how it works

Every version is stored in people_data, and the people VIEW is used to pull out the most recent version based on the version metadata.

Most importantly, nothing is ever deleted or updated in people_data. It just appears to be deleted when looking in the people VIEW.

Nothing is ever deleted from the version table either. The version table is the global log of everything that has happened. It might be feasible to store more metadata there, such as what data table the revision affected, what kind of ”logical” operation happened (insert, update, delete).

The version table is only inserted into, except when updating the end date to pass the ”current record” status on to the next version.

For every data table we want versioned we add the data table and the corresponding VIEW. The version table is common for all versioned data tables. That’s all there is to it! This means that versioned join tables can be used without any extra effort.

Some extra routines are needed for the basic CRUD operations (Create, Read, Update and Delete). Inserts involve two tables, updates are actually inserts, and deletes are actually updates. Some of the CRUD stuff can be handled by triggers but it’s probably easier to handle it in the database abstraction of the application. 

When adding a ”people” record, first add a version record with a generated guid string, then create the people record and set people.id to version.id.

Update the people record identified by guid in three steps:

  1. Update the existing version record by setting timestamp_end and changedby_end
  2. Add a new version record with the same guid. Capture the new version.id
  3. Add a new complete people record with the updated information and set id to version.id.

To ”delete” a people record, just set timestamp_end of the current version record for that guid. The people record will disappear from the people view.

To show a people record in the state it was at a specific point in time, just do like this:

SET @timeslice=”2010-02-22 10:00:00″;

SELECT * FROM people;

 

Testdata SQL

/* INSERT RECORD */

INSERT INTO version (guid, changedby_start) VALUES (”gweyufghkj”, ”JS”);

INSERT INTO people_data (id_version, firstname, lastname) VALUES(LAST_INSERT_ID(), ”Nisse”, ”Hult”);

 

/* VIEW RECORDS – most current version */

SELECT * FROM people;

/* 1 Nisse Hult 2010-02-26 23:47:57 gweyufghkj */

 

/* UPDATE RECORD */

/* First make the previous version record obsolete */

UPDATE version SET changedby_end=”JS”, dt_end=NOW() WHERE guid=”gweyufghkj” AND dt_end=”3000-01-01 00:00:00″;

/* Then insert a new version record */

INSERT INTO version (guid, changedby_start) VALUES (”gweyufghkj”, ”JS”);

/* Finally insert a new people record with the changed data */

INSERT INTO people_data (id_version, firstname, lastname) VALUES(LAST_INSERT_ID(), ”Nisse”, ”Bengtsson”);

 

/* VIEW RECORDS – most current version */

SELECT * FROM people;

/* 2 Nisse Bengtsson 2010-02-26 23:48:11 gweyufghkj */

 

/* VIEW RECORDS – the version at a specific time, consistent across all versioned tables */

SET @history=”2010-02-26 23:48:00″;

SELECT * FROM people;

/* 1 Nisse Hult 2010-02-26 23:47:57 gweyufghkj */

 

/* DELETE RECORD – all record history will be maintained */

UPDATE version SET changedby_end=”JS”, dt_end=NOW() WHERE guid=”gweyufghkj” AND dt_end=”3000-01-01 00:00:00″;

 

/* SEE THE DELETED RECORD AGAIN – at a time when it still existed */

SET @history=”2010-02-26 23:48:15″;

SELECT * FROM people;

/* 2 Nisse Bengtsson 2010-02-26 23:48:11 gweyufghkj */

Update: corrected variable name @timeslice to @history

Update:

Online demo here http://johan.solve.se/dataversioning/

Download demo files for Lasso 8 (requires Knop) http://johan.solve.se/dataversioning/demo.zip

Presentation:

 

9 tankar kring ”Data Versioning

  1. Johan Sölve

    I’ve made some tests with the people table, a customer table and a join table (all tables are versioned, even the join table) to see how they behave with some real world volumes. I had 3000 rows in the people table, 1137 rows in the customer table and 3000 rows in the join table.Here’s a query that shows all people with their company.SELECT c.name, c.uid uid_customer, p.firstname, p.lastname, p.uid uid_people
    FROM customer c
    LEFT JOIN people_customer_map pm ON c.uid=pm.uid_customer
    LEFT JOIN people p ON p.uid=pm.uid_people;The query runs in about 40 ms locally on my MacBook and returns 3000 rows.This is the EXPLAIN for that query.1
    SIMPLE
    data_customer
    ALL
    id_revision
    NULL
    NULL
    NULL
    1137
    1
    SIMPLE
    revision
    eq_ref
    PRIMARY
    PRIMARY
    8
    recordversioning.data_customer.id_revision
    1
    Using where1
    SIMPLE
    data_people_customer_map
    ref
    uid_customer
    uid_customer
    23
    recordversioning.revision.uid
    10
    1
    SIMPLE
    revision
    eq_ref
    PRIMARY
    PRIMARY
    8
    recordversioning.data_people_customer_map.id_revision
    1
    1
    SIMPLE
    revision
    ref
    PRIMARY,uid
    uid
    23
    recordversioning.data_people_customer_map.uid_people
    10
    1
    SIMPLE
    data_people
    ref
    id_revision
    id_revision
    9
    recordversioning.revision.id
    11
    Using whereIt looks like this time machine performs almost as good as a regular unversioned solution would do.

    Svara
  2. Johan Sölve

    Eric, Just set a variable in the same mysql session/connection:SET @timeslice="2010-02-22 10:00:00";

    Svara
  3. Johan Sölve

    Sorry, that variable name was not right. It should be @history. SET @history="2010-02-22 10:00:00";

    Svara

Lämna ett svar

Din e-postadress kommer inte publiceras. Obligatoriska fält är märkta *