SECURITY: How to Evaluate Your Business Need for Data Deduplication

The volume of data generated by companies today is growing explosively. More powerful computing technology and the evolution to an information-based economy are causing companies to generate more data than ever before. To deal with this overwhelming data growth and related storage requirements, many companies are evaluating the use of data deduplication technology.

By simple definition, data deduplication technology is software that compares data in new backup streams to data that has already been stored to identify and remove duplicates. Today, deduplication has become an essential tool in helping data managers to control exponential data growth in the backup environment. However, the methods used to accomplish data deduplication vary widely as do the levels of capacity optimization they can provide. For example, virtual tape libraries provide a level of performance and reliability that traditional physical tape systems cannot approximate. VTLs enable companies to back up data many times faster than tape, restore data quickly and eliminate a variety of time-consuming manual tasks. However, without data deduplication, the cost of disk is higher than that of tape, forcing companies to use disk space carefully by keeping online retention times short and moving data to tape archive as quickly as possible.

To truly understand data deduplication, it's vital to understand the differing approaches to data deduplication. There are two general ways that deduplication technologies operate-- hash-based comparison and the ContentAware� comparison method used in the SEPATON DeltaStor� deduplication software on an S2100�-ES2 virtual tape library (VTL).

The hash-based approach runs incoming data through an algorithm that assigns a unique number (called a hash) to every chunk of data. It then compares the new hashes to those that have already been stored in a lookup table. If the new hash does not match, then it stores the corresponding chunk of data and adds the new hash to the lookup table. If the new hash does match one in the lookup table, it does not write the corresponding data to disk and records the duplicate in the hash table so that the data can be reconstituted for restores.

Meanwhile, the ContentAware approach reads the data that is in the backup and identifies commonalities and relationships between the objects/documents (e.g., Microsoft� Word document to Word document or Oracle� database to Oracle database) to narrow the search for duplicates. It then compares data in these objects at the byte level for maximum capacity reduction.

As described above, hash-based technologies start by breaking data into chunks and assigning a number to each chunk called a hash. New data is stored, and duplicate data is simply recorded in a "use count" tally. Each new backup gets broken up into more pieces that have to be identified, compiled and reassembled to restore. As a result, the more data stored on the system, the more pieces you generate. In contrast, the ContentAware approach uses the most recent (newest) backup as the reference data set. It compares data stored previously to this reference data set to identify duplicates.

Another distinction between deduplication technologies is whether they deduplicate a given backup set inline as part of the backup process or concurrently with the backup process. Inline deduplication aligns well with hash-based comparison technologies and provides a cost-effective way for small to medium-sized organizations to reduce their data center capacity needs. The concurrent method begins the deduplication process as the first backup job completes. It has several distinct advantages for larger backup volumes. With it, the VTL can load balance the backup and deduplication processes across multiple nodes, enabling it to complete both processes faster than an inline system. It also stores the most recent backup in its intact form, enabling it to perform a data integrity check before any duplicate data is replaced with pointers.

Many deduplication technologies cannot scale backup performance or deduplication processing across multiple processing nodes. As a result, you have to add multiple individually-managed boxes (see capacity scalability above), or tolerate significantly slower backup times. With the Scale-Out Deduplication� capability of a SEPATON VTL with DeltaStor software, you can add capacity or performance to backup and deduplicate petabytes of data in a single system.

It's important to note that most deduplication technologies are "all or nothing," requiring you to deduplicate all of your backup data and to do so with the same algorithm. This method is adequate for small backup environments. However, in an enterprise, being able to fine-tune deduplication to your needs, data types and business objectives is essential. The efficiencies to be gained through deduplication depend on a number of factors, including (but not limited to):

- The amount of duplicate data in the backup stream
- The data application type (Exchange, Oracle, Word, etc.)
- The required online data retention period--longer retention times result in greater deduplication efficiency
- The number of times per week that full backups are performed

When considering a data deduplication solution for your business, be sure to evaluate how each potential solution meets your needs, which may include:

- Backup Performance and Time to Protection - Be sure to understand how a data deduplication technology will affect your backup and how quickly your data will be moved to the protection of a VTL. If you have full backups of more than 10 TB, you should consider an enterprise-optimized deduplication technology like DeltaStor� software.
- Restore Performance - Choose a technology based on three key characteristics of your file restore needs: how often you need to restore files; the age of the files you typically restore (e.g., how often are files more than 30 days old) and how quickly you need to complete file restores. If restore time is a priority for you, choose a system that uses forward differencing to ensure that restores can be performed instantly, without "reconstitution".
- Deduplication Efficiency - It makes sense that the more duplicate data you have in your backup stream, the more beneficial a deduplication technology will be in your environment. Understand what level of deduplication efficiency is realistic in your environment and whether that is sufficient to offset your data growth.
- Risk to Data Integrity -Consider a solution that keeps an intact copy of your most recent backup and performs a second-level data integrity check .
- Capacity and Performance Scalability - Before choosing a technology, understand the implications of outgrowing your capacity and performance. Will adding capacity and performance mean maintaining numerous "silos of storage" or require a forklift upgrade to a new system?

For more information on choosing which technology is right for you, and more specifically, SEPATON's DeltaStor

By: Sepaton

SECURITY

Minggu, 01 Maret 2009

How to Evaluate Your Business Need for Data Deduplication

Tidak ada komentar:

Posting Komentar

Pengikut

Arsip Blog

Mengenai Saya