What are the best practices for compress_segmentby?

LuxCore · October 18, 2023, 6:56am

Hi all! The setting compress_segmentby is like an index. What are recomendations to values of segment columns.
For example the table has many cols among which are col1 and col2. The col1 has only two values: 0 and 1. The col2 has more than 50 000 distinct values but it is a great candidate for index.
And what is optimal sequence of this fields would be? compress_segmentby => 'col1, col2' or compress_segmentby => 'col2, col1'.
Perhaps there is limit for distinct values for compress_segmentby column?

Thanks in advance

jonatasdp · October 18, 2023, 5:24pm

Hi Lux, I guess it will also depend on the percentage of values with 0 and values with 1 in the col1. I remember @yannis sharing how inefficient can be indexing large cardinality.

If you have the opportunity to benchmark both, share the results with us.

@yannis any extra thoughts about it?

LuxCore · October 19, 2023, 5:27am

Hi, @jonatasdp!
For col1 = 0 the count of rows equals 326 000 000, and for col1 = 1 - 234 000 000.
The query SELECT COUNT(*) FROM <hypertable> WHERE col1 = 1; ran for 26s (it is fast for me).

yannis · October 23, 2023, 12:52pm

Hi @LuxCore,

Segment by columns are really important for both achieving higher compression rates, but also for speeding up queries on compressed chunks.

Compression is a balancing act and you’ll achieve the most if you can keep the compressed batches as full as possible. As you can read on the link I shared above, we cluster the data based on the segment by values and then compress all the tuples in each batch together. You want to get as close to 1000 tuples per batch as possible for optimal performance.

You want to select segment by columns that have 2 main characteristics:

Group your data in a natural way → better compression, locality and query performance when querying using filters on the segment by column(s)
Each unique combination of values results in ~1000 tuples or more on your uncompressed chunk → better compression and locality

From that perspective, you don’t want values that result to too many records (e.g. col1 in your example) as they do not really offer that much on their own. But you don’t want on the other side to pick a column with very high cardinality as well as you may end up with too few records per segment by. The counter example on this side would be picking a unique column for the segment by, which would result to 1 record per compressed batch; you would compress a tuple on its own, achieving zero gains from compression (negative gains in reality as we’d have to add metadata for compression)

LuxCore · November 11, 2023, 2:30pm

Hi, @yannis !

I have a great field-candidate for segmentby (following your explanations above) but 50% of rows for this field contain NULL value. How to split table/chunks to support segmenting for non-null values?

Thanks in advance

LuxCore · December 13, 2023, 1:33pm

May be someone knows what can be done to segment by a field for which 50% of the values contain NULL?