My Educational Personal Blog: 01/08/12

Data deduplication Drawbacks and concerns

Whenever data is transformed, concerns a rise about potential loss of data. By definition, data deduplication systems store data differently from how it was written. As a result, users are concerned with the integrity of their data. The various methods of deduplicating data all employ slightly different techniques. However, the integrity of the data will ultimately depend upon the design of the deduplicating system, and the quality used to implement the algorithms. As the technology has matured over the past decade, the integrity of most of the major products has been well proven.

One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non zero. Thus, the concern arises that data corruption can occur if a hash collision The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower probability of data loss than the risk of an undetected and uncorrected hardware error in most cases and can be in the order of 10^-49% per petabyte (1,000 terabyte) of data. occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity.

Some cite the computational resource intensity of the process as a drawback of data deduplication. However, this is rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems. This can be an issue when the deduplication is embedded within devices providing other services. To improve performance, many systems utilize weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.

Another area of concern with deduplication is the related effect on snapshots, backup, and archival, especially where deduplication is applied against primary storage (for example inside a NAS filer). Reading files out of a storage device causes full reconstitution of the files, so any secondary copy of the data set is likely to be larger than the primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot will preserve the entire original file. This means that although storage capacity for primary file copies will shrink, capacity required for snapshots may expand dramatically.

Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks, whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though the underlying data may be redundant. Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to the LOCKSS storage architecture that achieves reliability through multiple copies of data.)

Scaling has also been a challenge for deduplication systems because ideally, the scope of deduplication needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete deduplication, then space efficiency is adversely affected. A deduplication shared across devices preserves space efficiency, but is technically challenging from a reliability and performance perspective.

Security concerns exist with deduplication. In some systems, an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data.

Data Deduplication Benefits

Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk—a surprisingly common scenario. In the case of data backups, which routinely are performed to protect against data loss, most of data in a given backup isn't changed from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).

Network data deduplication is used to reduce the absolute number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information.

Virtual servers benefit from deduplication because it allows nominally separate system files for each virtual server to be coalesced into a single storage space. At the same time, if a given server customizes a file, deduplication will not change the files on the other servers—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved.

Data deduplication

In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the chunk size), the amount of data that must be stored or transferred can be greatly reduced.

This type of deduplication is different than that which is performed by standard file compression tools such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.

Major features of NetBackup

Data Deduplication
- Client or server-side deduplication via integration with the PureDisk data deduplication engine

Security
- Data Encryption
- Access Control

Performance
- Synthetic Backups
- Disk Staging
- Checkpoint restart
- Multiplexed Backup
- Multi-streamed Backup
- Inline Copy
- Online NetBackup catalog backup

Management and Reporting
- Web-based management reporting (VERITAS NetBackup Operations Manager)
- Tape volume, drive and library viewing
- Error message identification, categorization and troubleshooting

Media Management
- Enterprise Media Manager
- Automatic robotic/tape drive configuration
- Broad tape device support

Heterogeneous Support
- Broad platform support
- Bare-metal restore
- Support for leading networking topologies
- Advanced software and hardware snapshot support
- NetBackup RealTime

History of Net BackUp

In 1987, Chrysler Corporation engaged Control Data Corporation to write a backup software solution. A small group of engineers (Rick Barrer, Rosemary Bayer, Paul Tuckfield and Craig Wilson) wrote the software. Other Control Data customers later adopted it for their own needs.
In 1990, Control Data formed the Automated Workstation Backup System business unit. The first version of AWBUS supported two tape drives in a single robotic carousel with the SGI IRIX operating system.
In 1993, Control Data renamed the product to BackupPlus 1.0 (this is why many NetBackup commands have a 'bp' prefix). Software improvements included support for media Volume Management and Server Migration/Hierarchical Storage Management.
In late 1993, Openvision acquired the product and Control Data’s Storage Management 12-person team. This is why, on UNIX platforms, NetBackup installs into /usr/openv. During this time, Open Vision renamed Backup Plus to NetBackup.
On May 6, 1997 Veritas acquired Openvision, including absorption of the NetBackup product line.
In 2005 Symantec acquired Veritas and NetBackup became a Symantec product. Also at that time, Symantec released NetBackup 6.0, the 30th version of the software.

My Educational Personal Blog

Pages

Sunday, January 8, 2012

Data deduplication Drawbacks and concerns

Data Deduplication Benefits

Data deduplication

Major features of NetBackup

History of Net BackUp

Featured Posts

Adding Accesspolicy to KeyVault for Service Principal Using Portal and Powershell