Enabling data purge
This section provides instructions on how to automate data purging when using PostgreSQL. Specifically, on the table with high growing tuple count (e.g. the data_sample_raw
table).
The same pattern and technique can be translated to another database dialect, but the code is Postgres specific. Also the scripts mentioned in this step are available in the Sensys repository contrib/database
directory.
Background on Sensys Data Purging Requirement
The Sensys process is often configured to continuously collect and store data to the database. Depending on the configuration, these data can grow rapidly and unbounded. Hence, there need to be a way to purge out the old data.
Please note, data archive, if interested, should be done before purging data. Once the data is purge, it is gone from the database. The Sensys PostgreSQL schema (orcmdb_psql.sql
) provides two ways to pure out data (based on timestamp attribute):
- A stored procedure to delete data after certain time interval from current time
- A trigger based partition with dropping partitions that has exceeded certain time interval from the timestamp of the latest tuple
Each of these will be shown in detail on how to deploy in the following sections.
Purging with 'purge_data()' Function
The purge_data()
function provides the simple way to deleting data after the specified time interval from current time. At the high level, this function resolves an SQL DELETE statement:
DELETE FROM table_name WHERE timestamp_column < CURRENT_TIMESTAMP - interval '<interval_number> <interval_unit>';
Below is an example of the SQL statement invocation of the purge_data()
function. This statement would DELETE all rows in the data_sample_raw
table WHERE time_stamp
column having value of less than 6 months from the current time.
SELECT purge_data('data_sample_raw', 'time_stamp', 6, 'MONTH');
The SQL statement example above can be schedule to run on a regular basis using pgAgent. Alternative, the statement can also be executed as a shell script using psql
tool. It then can easily be scheduled to run on a regular basis using any schedule tool such as crontab
.
$ psql -U username -d database -c "SELECT purge_data('data_sample_raw', 'time_stamp', 6, 'MONTH');"
Purging with Dropping Partition
Another way to purge data is to DROP the table rather than DELETE tuple from a table as in how the purge_data()
function works. We do not want to DROP the whole table, but only a section of the table. There is no such thing as DROP a section of a table; However, DROP of a table partition is possible. So for this to work, we will first enable partition on the table and DROP any out dated partition.
Below is the detail instruction on how to enable partition on a given table with and without the helper script.
Partition without Helper (psql)
The steps in this section involve invoking PostgreSQL functions. If you are not comfortable with SQL, see the next section on how to perform this task with the helper script.
Note, running the generate_partition_triggers_ddl()
will only output the DDL code to enable partition. It will not make any change to the database. Also a few things to be aware when calling this function:
- Within a GUI application such as pgAdminIII, the result may be truncated, so be sure to set it to allow longer length of characters in the return result.
- From the command line such as
psql
the result is displayed as
Below is an example of the SQL statement invocation on the generated partition code for the data_sample_raw
table on the time_stamp
column by DAY and keep the last 180 partitions.
-
Generate the partition DDL (data definition language) code.
SELECT generate_partition_triggers_ddl('data_sample_raw', 'time_stamp', 'DAY', 180)';
-
Review the generated code from the output above. Note, some GUI tool may be set to limit the length of the string in the result. Check the GUI setting or run from command line interface (psql) to get the complete generated code.
-
Turn on auto dropping old partition by uncomment the
-- EXECUTE('DROP TABLE...
statement within the generated code. By default, the generated DROP TABLE statement is commented out. -
Execute the modified and reviewed version of the generated code
Partition with Helper Script
The same process can be done using pg_partition_helper.py
, a python helper script that would perform the step above. This pythons script uses SQLAlchemy to connect to the PostgreSQL database and execute the SQL DML (data manipulation language) and DDL. For this to work, the system needs to have the following python modules installed:
- SQLAlchemy - Database Abstraction Library
% pip install --upgrade SQLAlchemy`
- psycopg2 - Python-PostgreSQL, also the default PostgreSQL driver used by SQLAlchemy
% pip install psycopg2
Because this helper script uses SQLAlchemy to connect to the database, the database connection string will need to be in the format of SQLAlchemy database URL syntax
The helper script expects the PostgreSQL database URL to be stored in the environment variable named PG_DB_URL
or passed to it with the --db-url
option, see example below on how to set it.
$ export PG_DB_URL=postgresql://[username[:password]]@host[:port]/database
The helper script has two commands: enable and disable partition. Use -h
or --help
to get usage syntax. Below is an example of enabling trigger based partition for the data_sample_raw
table on the time_stamp
column by DAY and keeping the last 10 partitions (based on the timestamp of the new tuple).
$ export PG_DB_URL=postgresql://[username[:password]]@host[:port]/database
$ python pg_partition_helper.py enable data_sample_raw time_stamp DAY --interval-to-keep 10
Example of disable partition for the data_sample_raw
table.
$ export PG_DB_URL=postgresql://[username[:password]]@host[:port]/database
python pg_partition_helper.py disable data_sample_raw
Managing Partitions
Table partitioning in PostgreSQL database is based on table inheritance. Each new partition of a main table is a whole new table that inherits the schema of the main table. In the examples above, data_sample_raw
is the main table that does not hold any record. All records are stored in its corresponding partition. When querying data from the main table, data_sample_raw
, PostgreSQL automatic queries the data from the partitioned tables. This can be verified using PostgreSQL EXPLAIN
statement.
Partition can be un-linked from the main table (no longer a part of the main table), this effectively provides an archive mechanism, or it can be DROPPED from the schema. Below is an example of breaking the inheritance for the data_sample_raw_20150701_time_stamp
partition from the main table data_sample_raw
.
ALTER TABLE data_sample_raw_20150701_time_stamp NO INHERIT data_sample_raw;
Recommendation
The simple purge_data()
function ultimately is an SQL DELETE statement. The process of deleting records usually involves some logging as the overhead. In contrast to deleting records, dropping the whole partition of a table is close to instantaneous (i.e. quick format). However, this requires partitioning a table in order to have partition to drop. Partition of a table in turn incurs a cost of executing a trigger to route each new tuple to the respective partition, but it provides many other benefits beside able to drop the whole partition.
The recommendation is to use the purge_data()
approach as this may already be sufficient for the data purging. The data partitioning approach is better for archive and potentially improve performance on accessing the data since the data is 'partitioned' to different tables.