My Educational Personal Blog: 2012

Friday, January 27, 2012

Difference between multiplexing and multistreaming

Multiplexing sends data from multiple sources to a single tape or disk device. This is useful if you have a tape or disk device that writes faster than a single system can send data, which (at this point) is just about every tape device.

Multiplexing does require tuning considerations with regards to restoration of the data. Generally speaking, the higher the multiplexing setting you use, the greater impact it can have on performance of an individual stream from within the set of multiplexed data. (If you're recovering all streams multiplexed together, MPX should have minimal impact on restore performance.) If NBU multiplexed several backups together, and you're only restoring one of them, it has to read all of the backups and throw away what it doesn't need. This reduces the overall effective throughput of the drive. If it reduces that speed to a speed slower than the bandwidth you have available for the restore, then it will slow you restore down.

Multistreaming establishes multiple connections, or threads, from a single system to the backup server. This is useful if you have a large system with multiple I/O devices and large amounts of data that need backing up.

Multistreaming also requires planning for implementation. To run effectively it needs to be able to get all needed resources at the same time to provide the shortest possible backup window. It also needs to be tested against the source storage to find the optimal number of streams that can be run. The more drives that are added will increase the backup speeds until the point where the source storage cannot provide the data any faster. At this point the backup times will start to increase because the storage is at 100% utilization.

The good news about Multistreaming is that it is able to shorten restore times for the same data. In general the destination storage will have a bottleneck at the writing of the data. With proper tuning of NUMBER_DATA_BUFFERS and SIZE_DATA_BUFFERS combined with fast tape and disk drives, restores in a Multistreaming configuration can approach a 1:1 ratio. This is a very good thing with respect to Disaster Recovery / Business Contingency.

Thursday, January 26, 2012

Checkpoint restart

Checkpoint restart is a facility offered by some database management systems (DBMSs) and backup-restore software. Checkpoints are taken in anticipation of the potential need to restart a software process.

Many ordinary batch processes on impersonal computers are time-consuming, as are backup and restore operations. They consist of many units of work. If checkpointing is enabled, checkpoints are initiated at specified intervals, in terms of units of work or of processing time. At each checkpoint, intermediate results and a log recording the process's progress are saved to non-volatile storage. The contents of the program's memory area may also be saved.

The purpose of checkpointing is to minimize the amount of time and effort wasted when a long software process is interrupted by a hardware failure, a software failure, or resource unavailability. With checkpointing, the process can be restarted from the latest checkpoint rather than from the beginning.

Checkpoint Frequency

Checkpoints should occur frequently enough to minimize wasted effort when a restart is necessary but not so frequently as to prolong the process unduly with checkpoint overhead. Optimal checkpoint frequency depends on the mean time between failures (MTBF), among other factors.

Disk Staging

Disk staging is using disks as an additional, temporary stage of backup process before finally storing backup to tape. Backups stay on disk typically for a day or a week, before being copied to tape in a background process and deleted afterwards.

The process of disk staging is controlled by the same software that performs actual backups, which is different from virtual tape library where intermediate disk usage is hidden from main backup software. Both techniques are known as D2D2T (disk-to-disk-to-tape).

Data is restored from disk if possible. But if the data exists only on tape it is restored directly (no backward-staging on restore).

Reasons behind using D2D2T:

increase performance of small, random-access restores: disk has much faster random access than tape
increase overall backup/restore performance: although disk and a tape have similar streaming throughput, you can easily scale disk throughput by the means of striping (and tape-striping is a much less established technique)
increase utilization of tape drives: tape shoe-shining effect is eliminated when staging (note that it may still happen on tape restores)

Synthetic Backups

A synthetic backup is identical to a regular full backup in terms of data, but it is created when data is collected from a previous, older full backup and assembled with subsequent incremental backups. The incremental backup will consist only of changed information. A synthetic backup would be used when time or system requirements do not allow for a full complete backup. The end result of combining a recent full backup archive with incremental backup data is two kinds of files which is merged by a backup application to create the synthetic backup. Benefits to using a synthetic backup include a smaller amount of time needed to perform a backup, and system restore times and costs are reduced. This backup procedure is called "synthetic" because it is not a backup created from original data.

Data Encryption

Data Encryption Standard (DES) is a widely-used method of data encryption using a private (secret) key that was judged so difficult to break by the U.S. government that it was restricted for exportation to other countries. There are 72,000,000,000,000,000 (72 quadrillion) or more possible encryption keys that can be used. For each given message, the key is chosen at random from among this enormous number of keys. Like other private key cryptographic methods, both the sender and the receiver must know and use the same private key.

DES applies a 56-bit key to each 64-bit block of data. The process can run in several

modes and involves 16 rounds or operations. Although this is considered "strong" encryption, many companies use "triple DES", which applies three keys in succession. This is not to say that a DES-encrypted message cannot be "broken." Early in 1997, Rivest-Shamir-Adleman, owners of another encryption approach, offered a $10,000 reward for breaking a DES message. A cooperative effort on the Internet of over 14,000 computer users trying out various keys finally deciphered the message, discovering the key after running through only 18 quadrillion of the 72 quadrillion possible keys! Few messages sent today with DES encryption are likely to be subject to this kind of code-breaking effort.

DES originated at IBM in 1977 and was adopted by the U.S. Department of Defense. It is specified in the ANSI X3.92 and X3.106 standards and in the Federal FIPS 46 and 81 standards. Concerned that the encryption algorithm could be used by unfriendly governments, the U.S. government has prevented export of the encryption software. However, free versions of the software are widely available on bulletin board services and Web sites. Since there is some concern that the encryption algorithm will remain relatively unbreakable, NIST has indicated DES will not be recertified as a standard and submissions for its replacement are being accepted. The next standard will be known as the Advanced Encryption Standard (AES).

Saturday, January 14, 2012

when to use Authorization Manager (AzMan)

Authorization Manager (AzMan) enables you to define individual operations, which can be grouped together to form tasks. You can then authorize roles to perform specific tasks and/or individual operations. AzMan provides an administration tool as a Microsoft Management Console (MMC) snap-in to manage roles, tasks, operations, and users. You can configure an AzMan policy store in an XML file, Active Directory, or in an Active Directory Application Mode (ADAM) store.
ASP.NET version 2.0 role management provides an API that enables you to manage application roles and users' membership of roles. By configuring the ASP.NET role manager to use the AuthorizationStoreRoleProvider, you can use the role management API against an AzMan policy store.
The AuthorizationStoreRoleProvider does not support AzMan business rules ("BizRules"), which are scripted extensions to authorization checks, because the current role manager implementation does not have the concept of extended data that can be passed along during an authorization check. To use AzMan BizRules, you need to use COM interop

Sunday, January 8, 2012

Data deduplication Drawbacks and concerns

Whenever data is transformed, concerns a rise about potential loss of data. By definition, data deduplication systems store data differently from how it was written. As a result, users are concerned with the integrity of their data. The various methods of deduplicating data all employ slightly different techniques. However, the integrity of the data will ultimately depend upon the design of the deduplicating system, and the quality used to implement the algorithms. As the technology has matured over the past decade, the integrity of most of the major products has been well proven.

One method for deduplicating data relies on the use of cryptographic hash functions to identify duplicate segments of data. If two different pieces of information generate the same hash value, this is known as a collision. The probability of a collision depends upon the hash function used, and although the probabilities are small, they are always non zero. Thus, the concern arises that data corruption can occur if a hash collision The hash functions used include standards such as SHA-1, SHA-256 and others. These provide a far lower probability of data loss than the risk of an undetected and uncorrected hardware error in most cases and can be in the order of 10^-49% per petabyte (1,000 terabyte) of data. occurs, and additional means of verification are not used to verify whether there is a difference in data, or not. Both in-line and post-process architectures may offer bit-for-bit validation of original data for guaranteed data integrity.

Some cite the computational resource intensity of the process as a drawback of data deduplication. However, this is rarely an issue for stand-alone devices or appliances, as the computation is completely offloaded from other systems. This can be an issue when the deduplication is embedded within devices providing other services. To improve performance, many systems utilize weak and strong hashes. Weak hashes are much faster to calculate but there is a greater risk of a hash collision. Systems that utilize weak hashes will subsequently calculate a strong hash and will use it as the determining factor to whether it is actually the same data or not. Note that the system overhead associated with calculating and looking up hash values is primarily a function of the deduplication workflow. The reconstitution of files does not require this processing and any incremental performance penalty associated with re-assembly of data chunks is unlikely to impact application performance.

Another area of concern with deduplication is the related effect on snapshots, backup, and archival, especially where deduplication is applied against primary storage (for example inside a NAS filer). Reading files out of a storage device causes full reconstitution of the files, so any secondary copy of the data set is likely to be larger than the primary copy. In terms of snapshots, if a file is snapshotted prior to deduplication, the post-deduplication snapshot will preserve the entire original file. This means that although storage capacity for primary file copies will shrink, capacity required for snapshots may expand dramatically.

Another concern is the effect of compression and encryption. Although deduplication is a version of compression, it works in tension with traditional compression. Deduplication achieves better efficiency against smaller data chunks, whereas compression achieves better efficiency against larger chunks. The goal of encryption is to eliminate any discernible patterns in the data. Thus encrypted data cannot be deduplicated, even though the underlying data may be redundant. Deduplication ultimately reduces redundancy. If this was not expected and planned for, this may ruin the underlying reliability of the system. (Compare this, for example, to the LOCKSS storage architecture that achieves reliability through multiple copies of data.)

Scaling has also been a challenge for deduplication systems because ideally, the scope of deduplication needs to be shared across storage devices. If there are multiple disk backup devices in an infrastructure with discrete deduplication, then space efficiency is adversely affected. A deduplication shared across devices preserves space efficiency, but is technically challenging from a reliability and performance perspective.

Security concerns exist with deduplication. In some systems, an attacker can retrieve data owned by others by knowing or guessing the hash value of the desired data.

Data Deduplication Benefits

Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk—a surprisingly common scenario. In the case of data backups, which routinely are performed to protect against data loss, most of data in a given backup isn't changed from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents).

Network data deduplication is used to reduce the absolute number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information.

Virtual servers benefit from deduplication because it allows nominally separate system files for each virtual server to be coalesced into a single storage space. At the same time, if a given server customizes a file, deduplication will not change the files on the other servers—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved.

Data deduplication

In computing, data deduplication is a specialized data compression technique for eliminating coarse-grained redundant data. The technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent across a link. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the chunk size), the amount of data that must be stored or transferred can be greatly reduced.

This type of deduplication is different than that which is performed by standard file compression tools such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.

Major features of NetBackup

Data Deduplication
- Client or server-side deduplication via integration with the PureDisk data deduplication engine

Security
- Data Encryption
- Access Control

Performance
- Synthetic Backups
- Disk Staging
- Checkpoint restart
- Multiplexed Backup
- Multi-streamed Backup
- Inline Copy
- Online NetBackup catalog backup

Management and Reporting
- Web-based management reporting (VERITAS NetBackup Operations Manager)
- Tape volume, drive and library viewing
- Error message identification, categorization and troubleshooting

Media Management
- Enterprise Media Manager
- Automatic robotic/tape drive configuration
- Broad tape device support

Heterogeneous Support
- Broad platform support
- Bare-metal restore
- Support for leading networking topologies
- Advanced software and hardware snapshot support
- NetBackup RealTime

History of Net BackUp

In 1987, Chrysler Corporation engaged Control Data Corporation to write a backup software solution. A small group of engineers (Rick Barrer, Rosemary Bayer, Paul Tuckfield and Craig Wilson) wrote the software. Other Control Data customers later adopted it for their own needs.
In 1990, Control Data formed the Automated Workstation Backup System business unit. The first version of AWBUS supported two tape drives in a single robotic carousel with the SGI IRIX operating system.
In 1993, Control Data renamed the product to BackupPlus 1.0 (this is why many NetBackup commands have a 'bp' prefix). Software improvements included support for media Volume Management and Server Migration/Hierarchical Storage Management.
In late 1993, Openvision acquired the product and Control Data’s Storage Management 12-person team. This is why, on UNIX platforms, NetBackup installs into /usr/openv. During this time, Open Vision renamed Backup Plus to NetBackup.
On May 6, 1997 Veritas acquired Openvision, including absorption of the NetBackup product line.
In 2005 Symantec acquired Veritas and NetBackup became a Symantec product. Also at that time, Symantec released NetBackup 6.0, the 30th version of the software.

My Educational Personal Blog

Pages