Josh-D. S. Davis

Xaminmo / Omnimax / Max Omni / Mad Scientist / Midnight Shadow / Radiation Master

Previous Entry Share Next Entry
Deduplication
Josh 201604 KWP
joshdavis
Deduplication is basically solid compression with a huge dictionary. If you send in pre-compressed data, your deduplication ratio will suffer. This is because, A) it's already compressed, and B) because the substitution table will be different for minor file changes.

The exception is when you compress as part of the deduplication processing. Some engines will chunk the data, tag/identify duplicate vs new, and THEN compress the data. This works pretty well, and in an incremental-forever environment, you can expect 6.8-7.8:1 reduction in stored data.

If you send fulls every day, then you can expect (days*0.9):1 ratio. Ie, if you keep 30 days of fulls, you can expect somewhere around 25:1 reduction in data (assuming 10% daily change rate).

If you multiplex/interleave your data, you can expect maybe 2:1. This is because your duplicate data is paired with different blocks each time. It's best to use "FILESPERSET=1" or "MULTIPLEX=1" or similar.

Also, depending on the ingest processor, chunk sizes up to 500GB may be good, but larger may change the chunk size. This can be a problem for systems which perform image AND file level backups of the same data, or archive logs plus databases.

For precompressed data, you can expect 1.3:1.

For encrypted data, you can expect 0.9-1.0:1, depending on the system and data size.

This all holds true, whether it's deduplication from Linux, CommVault, TSM, ProtecTier, FalconStor, or whatever. The only real difference between different forms of deduplication at this point are the performance of the chunking and identification algorithms, and the scalability of the chunk store.

?

Log in

No account? Create an account