Using EDB Advanced Storage Pack

Suggest edits

The following are scenarios where the EDB Advanced Storage Pack TAMs are useful.

Refdata example

A scenario where Refdata is useful is creating a reference table of all the New York Stock Exchange (NYSE) stock symbols and their corporate names. This data is expected to change very rarely and be referenced frequently from a table tracking all stock trades for the entire market.

Consider the following two tables:

CREATE TABLE nyse_symbol (
    nyse_symbol_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    symbol TEXT NOT NULL,
    name TEXT NOT NULL
);

CREATE TABLE nyse_trade (
    nyse_symbol_id INTEGER NOT NULL REFERENCES nyse_symbol(nyse_symbol_id),
    trade_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    trade_price FLOAT8 NOT NULL CHECK(trade_price >= 0.0),
    trade_volume BIGINT NOT NULL CHECK(trade_volume >= 1)
);

CREATE INDEX ON nyse_trade USING BTREE(nyse_symbol_id);

When heap is used for nyse_symbol, manipulating rows in nyse_trade causes row locks to be created in nyse_symbol. But only row locks are used in nyse_symbol:

=# BEGIN;
BEGIN
=*#
=*# INSERT INTO nyse_symbol (symbol, name)
-*#     VALUES ('A', 'A');
INSERT 0 1
=*#
=*# SELECT locktype, mode FROM pg_locks
-*#     WHERE relation = 'nyse_symbol'::regclass;
 locktype |       mode
----------+------------------
 relation | RowExclusiveLock
(1 row)
=*#
=*# COMMIT;
COMMIT
=#
=# BEGIN;
BEGIN
=*#    -- insert data into a table that has a foreign key to nyse_symbol
=*#    INSERT INTO nyse_trade (nyse_symbol_id, trade_price, trade_volume)
-*#        VALUES (1, 1, 1);
INSERT 0 1
=*#
=*#    -- display the row locks in nyse_symbol
=*#    SELECT * FROM pgrowlocks('nyse_symbol');
-[ RECORD 1 ]-----------------
locked_row | (0,1)
locker     | 778
multi      | f
xids       | {778}
modes      | {"For Key Share"}
pids       | {21480}
=*#

However, when refdata is used for nyse_symbol, the locking pattern changes. The table is created with the USING refdata clause:

CREATE TABLE nyse_symbol (
    nyse_symbol_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    symbol TEXT NOT NULL,
    name TEXT NOT NULL
) USING refdata;

In this case, manipulating data in nyse_trade doesn't generate row locks in nyse_symbol. But manipulating nyse_symbol directly causes an EXCLUSIVE lock to be acquired on the entire relation:

=# BEGIN;
BEGIN
=*#
=*# INSERT INTO nyse_symbol (symbol, name)
-*#     VALUES ('A', 'A');
INSERT 0 1
=*#
=*# SELECT locktype, mode FROM pg_locks
-*#     WHERE relation = 'nyse_symbol'::regclass;
 locktype |       mode
----------+------------------
 relation | RowExclusiveLock
 relation | ExclusiveLock
(2 rows)
=*#
=*# COMMIT;
COMMIT
=#
=# BEGIN;
BEGIN
=*#    -- insert data into a table that has a foreign key to nyse_symbol
=*#    INSERT INTO nyse_trade (nyse_symbol_id, trade_price, trade_volume)
-*#        VALUES (1, 1, 1);
INSERT 0 1
=*#
=*#    -- display the row locks in nyse_symbol
=*#    SELECT * FROM refdata.pgrowlocks('nyse_symbol');
(0 rows)
=*#

Autocluster example

A scenario where Autocluster is useful is with Internet of Things (IoT) data, which are usually inserted with many rows that relate to each other and often use append-only data. When using heap instead of Autocluster, Postgres can't cluster together these related rows, so access to the set of rows touches many data blocks, can be very slow, and is input/output heavy.

This example is for an IoT thermostat that reports house temperatures and temperature settings every 60 seconds:

CREATE TABLE iot (
    thermostat_id         BIGINT NOT NULL,
    recordtime            TIMESTAMPTZ NOT NULL,
    measured_temperature  FLOAT4,
    temperature_setting   FLOAT4
) USING autocluster;

Using Autocluster, rows with the same thermostat_id are clustered together and are easier to access:

CREATE INDEX ON iot USING btree(thermostat_id);
SELECT autocluster.autocluster(
    rel := 'iot'::regclass,
    cols := '{1}',
    max_objects := 10000
);
Note

The cols parameter specifies the table that's clustered. In this case, {1} corresponds to the first column of the table, thermostat_id, which is the most common access pattern.

Populate the table with the thermostat_id and recordtime data:

INSERT INTO iot (thermostat_id, recordtime) VALUES (456, 12:01);
INSERT INTO iot (thermostat_id, recordtime) VALUES (8945, 04:55);
INSERT INTO iot (thermostat_id, recordtime) VALUES (456, 15:32);
INSERT INTO iot (thermostat_id, recordtime) VALUES (6785, 01:36);
INSERT INTO iot (thermostat_id, recordtime) VALUES (456, 19:25);
INSERT INTO iot (thermostat_id, recordtime) VALUES (5678, 03:44);

When you select the data from the IoT table, you can see from the ctid location that the data with the same thermostat_id was clustered together:

SELECT ctid, thermostat_id, recordtime FROM iot ORDER BY CTID;
Output
ctid   | thermostat_id | recordtime
-------+---------------+------------
 (0,1) |           456 |      12:01
 (0,2) |           456 |      15:32
 (0,3) |           456 |      19:25
 (2,2) |          8945 |      04:55
 (2,5) |          5678 |      03:44
 (3,2) |          6785 |      01:36
(6 rows)

Advanced example

This is an advanced example where Refdata and Autocluster are used together. It involves referencing the NYSE table from the Refdata example and clustering together the rows in the trade table based on the stock symbol. This approach makes it easier to find the latest number of trades.

Start with the NYSE table from the Refdata example:

CREATE TABLE nyse_symbol (
    nyse_symbol_id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    symbol TEXT NOT NULL,
    name TEXT NOT NULL
) USING refdata;

Create a highly updated table containing NYSE trades, referencing the mostly static stock symbols in the Refdata table. Cluster the rows on the stock symbol to make it easier to look up the last x trades for a given stock:

CREATE TABLE nyse_trade (
    nyse_symbol_id INTEGER NOT NULL REFERENCES nyse_symbol(nyse_symbol_id),
    trade_time TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    trade_price FLOAT8 NOT NULL CHECK(trade_price >= 0.0),
    trade_volume BIGINT NOT NULL CHECK(trade_volume >= 1)
) USING autocluster;

CREATE INDEX ON nyse_trade USING BTREE(nyse_symbol_id);
SELECT autocluster.autocluster(
    rel := 'nyse_trade'::regclass,
    cols := '{1}',
    max_objects := 3000
);
 autocluster
-------------

(1 row)

Prepopulate the static data (shortened for brevity):

INSERT INTO nyse_symbol (symbol, name) VALUES
    ('A', 'Agilent Technologies'),
    ('AA', 'Alcoa Corp'),
    ('AAC', 'Ares Acquisition Corp Cl A'),
    ('AAIC', 'Arlington Asset Investment Corp'),
    ('AAIN', 'Arlington Asset Investment Corp 6.000%'),
    ('AAN', 'Aarons Holdings Company'),
    ('AAP', 'Advance Auto Parts Inc'),
    ('AAQC', 'Accelerate Acquisition Corp Cl A'),
    ('ZTR', 'Virtus Total Return Fund Inc'),
    ('ZTS', 'Zoetis Inc Cl A'),
    ('ZUO', 'Zuora Inc'),
    ('ZVIA', 'Zevia Pbc Cl A'),
    ('ZWS', 'Zurn Elkay Water Solutions Corp'),
    ('ZYME', 'Zymeworks Inc');
ANALYZE nyse_symbol;

Insert artificial stock trades, one trade per stock symbol, repeating the pattern multiple times:

INSERT INTO nyse_trade
    SELECT nyse_symbol_id, now(), i, i
    FROM nyse_symbol, generate_series(1,1000000) AS i;
ANALYZE nyse_trade;

Given that the inserts intercalated nyse_symbol_id, a query that consults one stock touches most pages if the table uses heap, but touches far fewer pages using Autocluster.

The following query operates on attributes that must be fetched from the table after an index scan and shows the number of buffers touched:

EXPLAIN (ANALYZE, BUFFERS, TIMING OFF, SUMMARY OFF, COSTS OFF)
    SELECT AVG(trade_volume * trade_price)
    FROM nyse_trade WHERE nyse_symbol_id = 10;

This is the query plan using Autocluster:

                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Aggregate (actual rows=1 loops=1)
 **Buffers: shared read=59609**
   ->  Bitmap Heap Scan on nyse_trade (actual rows=1000000 loops=1)
         Recheck Cond: (nyse_symbol_id = 10)
         Heap Blocks: exact=58824
         Buffers: shared read=59609
         ->  Bitmap Index Scan on nyse_trade_nyse_symbol_id_idx (actual rows=1000000 loops=1)
               Index Cond: (nyse_symbol_id = 10)
               Buffers: shared read=785
(9 rows)

For contrast, this is the query plan using heap:

                                          QUERY PLAN
----------------------------------------------------------------------------------------------
 Aggregate (actual rows=1 loops=1)
 **Buffers: shared read=103727**
   ->  Bitmap Heap Scan on nyse_trade (actual rows=1000000 loops=1)
         Recheck Cond: (nyse_symbol_id = 10)
         Rows Removed by Index Recheck: 8325053
         Heap Blocks: exact=37020 lossy=65922
         Buffers: shared read=103727
         ->  Bitmap Index Scan on nyse_trade_nyse_symbol_id_idx (actual rows=1000000 loops=1)
               Index Cond: (nyse_symbol_id = 10)
               Buffers: shared read=785
(10 rows)

Could this page be better? Report a problem or suggest an addition!