I have a database which stores receiver
to indicate which account the data relates to. This has led to tons of duplication of data, as one set of data may create 3 separate rows, where all column data is the same with the exception of the receiver
column. While redesigning the database, I have considered using an array with a GIN index instead of the current B-Tree index on receiver.
Current table definition:
CREATE TABLE public.actions ( global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass), time timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP, receiver text NOT NULL, tx_id text NOT NULL, block_num integer NOT NULL, contract text NOT NULL, action text NOT NULL, data jsonb NOT NULL);
Indexes:
- "actions_pkey" PRIMARY KEY, btree (global_sequence, time)
- "actions_time_idx" btree (time DESC)
- "receiver_idx" btree (receiver)
Field details:
- Global sequence is a serially incrementing ID
- Block number and time are not unique, but also incrementing
- Global sequence and time are primary key, as the data is internally partitioned by time
- There are some receivers that have over 1 billion associated actions (each with a unique global_sequence).
- Average text lengths:
- Receiver: 12
- tx_id: 52
- contract: 12
- action: 6
- data: small-medium sized JSONB with action metadata
Cardinality of 3 schema options:
- Current: sitting at 4.2 billion rows in this table
- Receiver as array: Would be at approximately 1.8 billion rows
- Normalized: There would be 3 tables:
- Actions: 1.8 billion rows
- Actions_Accounts: 4.2 billion rows
- Accounts: 500 000 rows
Common Query:
SELECT * FROM actions WHERE receiver = 'Alpha' ORDER BY time DESC LIMIT 100
All columns are required in the query. NULL values are not seen. I believe joins in the normalized schema would slow down & query speed is #1 priority)