Enabling data purge

This section provides instructions on how to automate data purging when using PostgreSQL. Specifically, on the table with high growing tuple count (e.g. the data_sample_raw table).

The same pattern and technique can be translated to another database dialect, but the code is Postgres specific. Also the scripts mentioned in this step are available in the Sensys repository contrib/database directory.

Background on Sensys Data Purging Requirement

The Sensys process is often configured to continuously collect and store data to the database. Depending on the configuration, these data can grow rapidly and unbounded. Hence, there need to be a way to purge out the old data.

Please note, data archive, if interested, should be done before purging data. Once the data is purge, it is gone from the database. The Sensys PostgreSQL schema (orcmdb_psql.sql) provides two ways to pure out data (based on timestamp attribute):

  • A stored procedure to delete data after certain time interval from current time
  • A trigger based partition with dropping partitions that has exceeded certain time interval from the timestamp of the latest tuple

Each of these will be shown in detail on how to deploy in the following sections.

Purging with 'purge_data()' Function

The purge_data() function provides the simple way to deleting data after the specified time interval from current time. At the high level, this function resolves an SQL DELETE statement:

DELETE FROM table_name WHERE timestamp_column < CURRENT_TIMESTAMP - interval '<interval_number> <interval_unit>';

Below is an example of the SQL statement invocation of the purge_data() function. This statement would DELETE all rows in the data_sample_raw table WHERE time_stamp column having value of less than 6 months from the current time.

SELECT purge_data('data_sample_raw', 'time_stamp', 6, 'MONTH');

The SQL statement example above can be schedule to run on a regular basis using pgAgent. Alternative, the statement can also be executed as a shell script using psql tool. It then can easily be scheduled to run on a regular basis using any schedule tool such as crontab.

$ psql -U username -d database -c "SELECT purge_data('data_sample_raw', 'time_stamp', 6, 'MONTH');"

Purging with Dropping Partition

Another way to purge data is to DROP the table rather than DELETE tuple from a table as in how the purge_data() function works. We do not want to DROP the whole table, but only a section of the table. There is no such thing as DROP a section of a table; However, DROP of a table partition is possible. So for this to work, we will first enable partition on the table and DROP any out dated partition.

Below is the detail instruction on how to enable partition on a given table with and without the helper script.

Partition without Helper (psql)

The steps in this section involve invoking PostgreSQL functions. If you are not comfortable with SQL, see the next section on how to perform this task with the helper script.

Note, running the generate_partition_triggers_ddl() will only output the DDL code to enable partition. It will not make any change to the database. Also a few things to be aware when calling this function:

  • Within a GUI application such as pgAdminIII, the result may be truncated, so be sure to set it to allow longer length of characters in the return result.
  • From the command line such as psql the result is displayed as

Below is an example of the SQL statement invocation on the generated partition code for the data_sample_raw table on the time_stamp column by DAY and keep the last 180 partitions.

  1. Generate the partition DDL (data definition language) code.

    SELECT generate_partition_triggers_ddl('data_sample_raw', 'time_stamp', 'DAY', 180)';
    
  2. Review the generated code from the output above. Note, some GUI tool may be set to limit the length of the string in the result. Check the GUI setting or run from command line interface (psql) to get the complete generated code.

  3. Turn on auto dropping old partition by uncomment the -- EXECUTE('DROP TABLE... statement within the generated code. By default, the generated DROP TABLE statement is commented out.

  4. Execute the modified and reviewed version of the generated code

Partition with Helper Script

The same process can be done using pg_partition_helper.py, a python helper script that would perform the step above. This pythons script uses SQLAlchemy to connect to the PostgreSQL database and execute the SQL DML (data manipulation language) and DDL. For this to work, the system needs to have the following python modules installed:

  • SQLAlchemy - Database Abstraction Library
% pip install --upgrade SQLAlchemy`
  • psycopg2 - Python-PostgreSQL, also the default PostgreSQL driver used by SQLAlchemy
% pip install psycopg2

Because this helper script uses SQLAlchemy to connect to the database, the database connection string will need to be in the format of SQLAlchemy database URL syntax

The helper script expects the PostgreSQL database URL to be stored in the environment variable named PG_DB_URL or passed to it with the --db-url option, see example below on how to set it.

$ export PG_DB_URL=postgresql://[username[:password]]@host[:port]/database

The helper script has two commands: enable and disable partition. Use -h or --help to get usage syntax. Below is an example of enabling trigger based partition for the data_sample_raw table on the time_stamp column by DAY and keeping the last 10 partitions (based on the timestamp of the new tuple).

$ export PG_DB_URL=postgresql://[username[:password]]@host[:port]/database
$ python pg_partition_helper.py enable data_sample_raw time_stamp DAY --interval-to-keep 10

Example of disable partition for the data_sample_raw table.

$ export  PG_DB_URL=postgresql://[username[:password]]@host[:port]/database
python pg_partition_helper.py disable data_sample_raw

Managing Partitions

Table partitioning in PostgreSQL database is based on table inheritance. Each new partition of a main table is a whole new table that inherits the schema of the main table. In the examples above, data_sample_raw is the main table that does not hold any record. All records are stored in its corresponding partition. When querying data from the main table, data_sample_raw, PostgreSQL automatic queries the data from the partitioned tables. This can be verified using PostgreSQL EXPLAIN statement.

Partition can be un-linked from the main table (no longer a part of the main table), this effectively provides an archive mechanism, or it can be DROPPED from the schema. Below is an example of breaking the inheritance for the data_sample_raw_20150701_time_stamp partition from the main table data_sample_raw.

ALTER TABLE data_sample_raw_20150701_time_stamp NO INHERIT data_sample_raw;

Recommendation

The simple purge_data() function ultimately is an SQL DELETE statement. The process of deleting records usually involves some logging as the overhead. In contrast to deleting records, dropping the whole partition of a table is close to instantaneous (i.e. quick format). However, this requires partitioning a table in order to have partition to drop. Partition of a table in turn incurs a cost of executing a trigger to route each new tuple to the respective partition, but it provides many other benefits beside able to drop the whole partition.

The recommendation is to use the purge_data() approach as this may already be sufficient for the data purging. The data partitioning approach is better for archive and potentially improve performance on accessing the data since the data is 'partitioned' to different tables.