I have large fact tables with composite non-clustered indexes, something like this:
NONCLSUTERED INDEX (OrderDate,OrderType,ClientKey,ItemKey,CustomerKey)
The columns ending with "Key" are varchar(50) columns, but they store strings of 20-30 characters.
Do I need to worry in long term, that the varchar columns in the index are actually varies in length? Will I get more page splits, slower performance when millions of rows merge into these tables? https://stackoverflow.com/questions/59667/what-are-the-use-cases-for-selecting-char-over-varchar-in-sql
Does the Sql Server engine have harder times to figure out the plan using that index since the columns are varies in length? I read that CHAR is better since its fixed size, and its easier for the db engine to calculate with, hence it performs better. Is it that significant?
Does it worth to hash these "Key" columns individually with lets say SHA2_256 function, so they become fixed size BINARY(32)?
NONCLSUTERED INDEX (OrderDate,OrderType,HASHEDClientKey,HASHEDItemKey,HASHEDCustomerKey)
I did my test to answer at least the performance question, and doesn't see any performance benefit with hashkeys over varchar. In fact its a big overhead to calculate the hashes and store the extra 32 bytes of each next to the original "Key" column.
What I cannot test is how these varchar indexes will behave when i have millions of rows merging into the tables each day?
Which one would you choose for both performance and maintenance point of view?
While some RDBMSs might be tuned to interact better with fixed-length strings (I seem to recall that maybe DB2/MVS did better with fixed-length strings, but I learned this in 1996 and it was mainframe DB2, so not sure if that even applied to DB2 on Unix or Windows), I would be shocked if using
CHAR(50) did anything but decrease performance as compared to
VARCHAR(50) given that most values are only using 20-30 bytes. Using
CHAR(50) takes up more space, which reduces the number of rows that fit on the data page, which increases the number of data pages, which takes up more space in the buffer pool (i.e. RAM) when querying the data, plus takes longer to backup and restore. AND, if I am not mistaken, it will also increase the memory grants (i.e. RAM requested per query) as it will request 50 bytes per field instead of 25 (I believe it requests 50% the max size of variable length fields). For a more thorough look at the down-stream effects of using more space than you need, please see: Disk Is Cheap! ORLY?.
That being said, between
CHAR(50), I would choose: neither ! Given the names of the columns, I am guessing that these are actually dimensions? Filtering on binary/numeric values is much faster than filtering on strings (there is one exception that I will mention in a moment). Depending on how many potential values there will be over time for clients, customers, and items, I would use either
INT or, if truly needed, then
BIGINT as the keys for the Client, Customer, and Item dimensions, and store those numeric key values here.
At bare minimum, I would change the collation on the
VARCHAR columns to be
Latin1_General_100_BIN2 since you clearly aren't needing case or accent insensitivity, else the idea of using a SHA-256 hash would not even be an option. That should at least be able to filter almost as fast as an
BIGINT (I say "almost" only due to, again, the effect of storing 20 - 30 byte values for 3 columns instead of 4 or 8 byte values for those same 3 columns, when accounting for millions/billions of rows).
I would strongly advise against using any type of hash value here. Not only would you be bloating the table (i.e. decreasing performance), but you would then lose any meaning of those values, and would hence either need to also store the non-hashed values for those 3 columns in the fact table (i.e. crazy) or require a lookup back to the dimension table (in which case you gained nothing over using
BIGINT at 4 or 8 bytes plus no calculation overhead!).
External links referenced by this document: