If You Do Dedupe on Primary Storage, Do You Have to Expand It to do Backups?

In his comment in the on-going discussion of the CORE formula ( Dedupe Rates Matter…Just Not as Much as You Think ) Steve Kenniston had the following to say about the relationship between shrinking data on primary storage and backups:

Example, if I use Ocarina deduplication, but have already purchased Data Domain, don’t I need to re-hydrate the Ocarina deduplicated, primary storage data before I use Data Domain? They say you do. That means I don’t really save on my primary storage if I need the space to re-hydrate before I back it up and that also means processing time on the array. Storwize, with random access compression doesn’t require decompression.

There are several interesting issues brought up here. They only relate to the CORE formula in that the formula does not account for them.

Today, if you have the most common dedupe for primary storage (NetApp dedupe) and the most common dedupe for backup (EMC’s Data Domain product), it works like this.

You start with a volume of, say, 16TB. NetApp Dedupe will shrink that to maybe 8TB. Then you go to backup. The backup server sends an NDMP request to the NetApp asking for a data stream to be sent to the backup target, the Data Domain. NetApp then rehydrates (expands) the 8TB back to 16TB and sends that stream to the Data Domain. The Data Domain will then dedupe that data back down to probably 4TB (since Data Domain dedupe is more sophisticated).

This is wasteful and has several negative consequences. It uses a bunch of CPU and I/O on the NetApp to rehydrate the data, which means performance might be slower for users while this is going on. You have to use the full network bandwidth to move the whole 16TB to the Data Domain. And you have to buy a Data Domain model big enough to handle 16TB of backup data instead of 8.

One thing that does not happen, which Steve seems to think is the case, is that you need 16TB of disk space on the NetApp to put all that expanded data in before it goes off to the backup.

The situation is ugly, but it is not that ugly.

Now, how would this take place if you used Ocarina dedupe and compression instead of NetApp. Let’s look at the following examples, and since NetApp has their own dedupe, let’s use a NAS filer that has nice integration with Ocarina, the BlueArc:

Scenario 1: Compression-only, BlueArc to Data Domain
Scenario 2: Dedupe and Compression, BlueArc to Data Domain, Full Backup
Scenario 3: Dedpue and Compression, BlueArc to Data Domain, Incrementals

To continue in the vein of the NetApp example, let’s say you started with 16TB. Ocarina will have shrunk that to maybe 4TB in the compression-only case, and maybe 2TB in the compression and dedupe case – because we shrink better than anyone. In the first case, Ocarina replaces each file with a compressed version of the file in the same volume. When you go to back up, you backup through a mount point that exports the volume without going through the decompression layer.

So this works just like it would with Storewize – the compressed versions of files go to the Data Domain. When you back up day after day, you will create duplicates, and the Data Domain will find and eliminate those.

It should be noted, though, that Data Domain results will be slightly worse with either Storewize or Ocarina compression – because compressing data makes it harder to find duplicates.

In the second scenario, we have not only compressed the data, but deduped it too. This makes backups more complicated, because the pieces and parts of a file may be spread around in many other files. Doing a full volume backup is pretty straightforward though. You first take a snapshot of the volume, and then back it up. It is important to take a snapshot, because you need to make sure all the pieces of files are consistent at a point in time. You back up the 2TB to the Data Domain. Now, does this mean you don’t need a Data Domain, because dedupe was already done at the source?

No, not at all. The first time you backup a volume, Data Domain won’t shrink it any, if at all, because there shouldn’t be any duplicates in the data set. Ocarina dedupe is at least as good as Data Domain’s. However, backup is not something you do once. It’s something you do every day. So when you back up the next day, and the day after that, and so forth, the Data Domain will find plenty of duplicates, as you backup the same files over and over. Over the course of a month or so, the Data Domain will be getting its 20:1 dedupe ratio even though the source volume was perfectly deduped already!

The third scenario is the most complicated. Doing incremental backups from a NAS usually means backup software and NDMP. The backup server, called a DMA, figures out which files have changed since the last backup, and then sends a request to the NDMP Data Server. Normally, NDMP is a service provided by the NAS head, but when you have used Ocarina to dedupe a volume, Ocarina will provide the NDMP Data Server. The NDMP data request will come to the Ocarina NDMP service and ask for the 1,237 files that have changed since yesterday. Now, you could just rehydrate those files and send them to the backup target. However, Ocarina has a dedupe-aware NDMP. This will figure out which chunks are needed to rehydrate the 1,237 files requested by the backup DMA, and will create – on the fly, using no disk space – a self-contained NDMP data stream that is deduped within itself.

You might see some partial rehydration, because the backup stream needs to have every block in it necessary to recover the files being backed up. But there will be no duplicate blocks in the data stream that goes to the backup target. What’s more, all those blocks or chunks will remain compressed. So what shows up at the Data Domain is a file-level incremental backup that is both deduped and compressed. This allows file level restores by the backup software DMA (like NetBackup or Commvault).

OK, now let’s take one more example, because this is the way you’d do data reduction as an enterprise strategy, rather than as a point solution for one filer. We call it end-to-end dedupe.

Scenario 4: Dedupe and Compression, BlueArc to BlueArc

In this case, we’re going to backup from one Ocarina-enabled BlueArc to another. The backup software will see the second BlueArc as an NFS backup target, just as it does a Data Domain. In either the full or the incremental case, the NDMP service will call the Ocarina NDMP Data Server on the Primary BlueArc. But when the backup starts, Ocarina will query the target on a known port to ask if it is Ocarina-aware. If the answer is yes, then instead of sending the data, Ocarina’s dedupe-aware NDMP will send just the hashes.

The target-side BlueArc (acting in place of the Data Domain) will examine those hashes and determine if it has any of that data already. It uses a negative acknowledgement protocol to tell the source BlueArc which chunks it needs to execute the backup. On the first backup, this will be all of the chunks or objects. But on subsequent backups, the dedupe in both places will be synched up, and only net new data is moved.

Now, in this case, you have true end-to-end dedupe. You dedupe and compression the primary storage. When it comes time to backup, you do not need any disk to store rehydrated data. Rather, you engage in an intelligent conversation with the backup target and move only the blocks, chunks, or objects needed to complete the backup, keeping every chunk in its compressed form. You can do this from any Ocarina-enabled source to any Ocarina-enabled target.
So it applies to more than just backups. This kind of optimized end-to-end solution applies to replication, tiering (primary to nearline, for example), migration, archiving (primary to object store), and backup.

Some significant percentage of the I/O’s in a data center are done not for user I/O or application I/O, but are done in support of storage management workflows. Ocarina can be deployed as an enterprise storage optimization solution – not a point solution for one NAS filer, not even a solution for one data center. Ocarina can be deployed across multiple tiers of storage – NAS, block, DAS, archive, object, cloud and backup – and then all storage management workflows will operate using dedupe-aware compressed data. The benefits here are not just saved disk space (and power and cooling and so forth), but network bandwidth, time, backup window reduction, and more.

If you are a storage vendor reading this, you should know that Ocarina now has a complete storage integration SDK. This is a set of API’s and documentation for integrating Ocarina – in-band or post-process – inside different types of storage. There is a framework for file system-based products, a framework for block products (DAS and storage arrays) and a framework for cloud and object stores (get/post/put model). If you want consistent, compatible, and integrated data reduction across you whole storage product line, give us a jingle.

Latest Images

Trending Articles

Latest Images