Diagnosing and Resolving ANR8779E With Error 16 or 170
Diagnosing and Resolving ANR8779E With Error 16 or 170
Diagnosing and Resolving ANR8779E With Error 16 or 170
Technote (troubleshooting)
Problem(Abstract)
Diagnosing and resolving ANR8779E with error number 16 / 170 (EBUSY) reported by the
Tivoli Storage Manager Server during drive open operations.
Symptom
ANR8779E Unable to open drive /dev/rmtX, error number = 16
Cause
What message ANR8779E with error number 16/170 (EBUSY) on OPEN means:
This is a return code returned to Tivoli Storage Manager by the operating system via the device
driver (IBMtape or TSMtape) when attempting to open a device special file. On Unix operating
systems, errno 16 means that the file was busy, or EBUSY. On Windows operating systems, the
errno is 170, but the meaning is the same. This error indicates that the operating system could not
satisfy Tivoli Storage Manager's request to open the device, because it was in use somewhere
else. When dealing with special device files, this means that the device has an open reservation
on it through a different HBA, which typically means on a separate host (but could be the same
host if multiple HBA's are in use).
The Library Manager is responsible for drive allocation and assignments. Once a drive is as
allocated,
the owning host (Library Client/STA) ensures exclusive ownership by "reserving" the drive and
the Library Manager maintains the drive inventory and owners.
The following is an example of the design of a single drive assignment in a library sharing
environment:
Generally, an EBUSY error is returned any time a reservation is held by a host that should not
hold it, or is behaving outside of the intended design of a library sharing environment.
Known Tivoli Storage Manager error number 16/170 (EBUSY) defect causes:
Under somewhat rare conditions, a Tivoli Storage Manager library client can become overloaded
with drive "in-use" confirmation heartbeats from library clients and/or storage agents. This
behavior opens a window in which the library manager may not be able to process the heartbeats
in a timely manner. Since the library manager is not able to process the heartbeats, it believes the
drives are no longer in use and will attempt to reclaim them for use by other processes. An
indicator of this problem may be the following warning:
ANR8925W Drive <drive name> in library <library name> has not been confirmed for use by
server <server name> for over <number> seconds. Drive will be reclaimed for use by others.
If the drive in question is not being used by the library client or storage agent, then this warning
would be expected and normal as there may have been an un-correctable problem with the
hardware or software involved and the library manager should reclaim the drive. If the drive in
question is actively being used by the library client or storage agent, this warning message and
the actions it triggers are incorrect.
When the drive is attempted to be reclaimed, EBUSY (ANR8779E) errors on the library
manager may be printed for this drive because the library client will still be holding the
reservation. Eventually, the library manager should be able to free the drive and the EBUSY
errors will stop.
The premature drive reclamation problem and potential solutions are documented by the
following APAR's:
IC54647 - This is a short term solution for V5 servers. A new option, LIBSHRTIMEOUT was
introduced to try and mitigate the issue.
IC55068 - This is the long term solution available only in V6.
IC63637 - This is an update to the long term solution.
If the library manager cannot be upgraded to a V6 level to get the long term fix, the short term
fix is to increase the heartbeat timeout using the library manager server option
"LIBSHRTIMEOUT." The default is 15 minutes, and the maximum is 60 minutes. If increasing
the timeout to 60 minutes does not resolve the problem (which is possible in complex and busy
environments), the only other option in V5 is reduce the amount of drive/library activity
occurring.
Known external error number 16/170 (EBUSY) causes:
Any IBM manufactured HBA's should have the following microcode/firmware levels applied
(depending on the model number):
df1000fd-0002.271304
df1000fd-0002.271310
df1000fe-0002.271315
SAN status/health monitoring utilities such as SanSurfer and HBAExplorer have been known to
place reserves on devices. There have been several reported defects in older versions of these
utilities that can place reserves on devices.
It has also been reported that HP DDMI (Discovery and Dependency Mapping Inventory) may
place reserves on drives during scans. This utility is part of the HP OpenView suite. There is no
known fix for this behavior at this time, so this application should be removed or the collection
for SAN attached devices should be disabled.
Recommendation: Upgrade any SAN monitoring utilities to current levels to avoid known
defects that can place reserves. Alternatively, cease using these utilities, especially when the
library environment is active or in-use.
3. Tapeutil/ITDT usage and/or scripting
Utilities to interact with tape devices such as tapeutil and ITDT can place reserves on devices.
Some customer's implement scripts to monitoring drive/library statuses using tapeutil/ITDT. If
any administrator or script is using such a utility to gather information on library/drive status, it
can place reserves on drives.
4. Tapeutil/ITDT defect
Tapeutil/ITDT contains a defect in which the application can leave a reserve on a device even if
it was properly closed before exiting the application. As such, any other application attempting to
reserve a device that tapeutil/ITDT still holds an orphaned lock against will receive a device
busy error.
Recommendation: Tapeutil has been deprecated by the ITDT (IBM Tape Diagnostic Tool).
Upgrade and use only the most currently available version of ITDT to avoid known defects.
ITDT can be downloaded from Fix Central.
Environments using 64-bit storage agents may be exposed to a timing condition within the
implementation and use of the Windows 64-bit HBA API. This timing condition can cause
device busy errors. The Tivoli Storage Manager storage agent code for Windows storage agents
has been altered to mitigate this problem by introducing retry logic. This work was introduced
via APAR IC61104, included in the following Tivoli Storage Manager levels: 5.4.5.1, 5.4.6,
5.5.3, 6.1.2 and 6.2.0.
Recommendation: Upgrade all Tivoli Storage Manager servers and storage agents to the level
5.4.5.1, 5.4.6, 5.5.3, 6.1.2, 6.2.0, or greater.
There have been numerous defects in the IBM tape (IBMtape/Atape/lin_tape) device driver code
that can cause reservation conflicts.
Review each drive's configuration using the "lsattr -El /dev/rmtxx" on AIX platforms(where xx
is the number of the drive). The "retain_reserve" setting returned should be set "no". If it is set to
yes, the drive can retain the reservation which can cause reservation conflicts when the drive is
given to another host for lanfree activity. The "chdev" command can be used to change the
attributes. Please contact the system administrator and/or AIX support if assistance is required.
Environment
All Tivoli Storage Manager environments using Lanfree or library sharing.
1. Review the Tivoli Storage Manager activity log, operating system logs (errpt, messages file,
event log), and SAN/device logs to completely define the problem:
* Are there any patterns? Do the device reservation conflicts always occur at the same time? Do
they always occur against the same devices(drives)?
* Do the errors happen on a library manager, or library client, or storage agent? Do they always
involve a specific host, drive, operation?
* Are there any library sharing communications failures?
* Are there any hardware or SAN errors around the time of the reservation conflict?
* Validate that all Tivoli Storage Manager library definitions are correct (QUERY LIBRARY
F=D).
* Validate that all Tivoli Storage Manager drive definitions are correct (QUERY DRIVE F=D).
* Validate that all Tivoli Storage Manager path definitions are correct (QUERY PATH F=D).
* Validate that all device WWN's, serial numbers, and device special files are correct for every
host that has Tivoli Storage Manager paths defined (Library Managers, Library Clients, and
Storage Agents).
Please note that the VALIDATE LANFREE command can be leveraged to confirm working
storage agent communications.
3. Review ALL of the above "Known causes" and apply any fixes applicable to your
environment. In summary, the following must be completed:
* Upgrade all Tivoli Storage Manager library clients, library managers, and storage agents to
currently available code. Do not forget to upgrade the Tivoli Storage Manager device driver if it
is in use.
* Upgrade all IBM tape device drivers to current.
* Upgrade all ITDT implementations to current across the entire environment.
* Upgrade any IBM manufactured HBA microcode/firmware to current.
* Upgrade any SAN monitoring software to current.
5. Use ITDT to determine the host where the conflict is being held to isolate the problem.
If the above does not resolve the issue, collect the data in the following MustGather document
before contacting IBM Tivoli Storage Manager support:
Enhanced diagnostics including extensive code tracing and problem reproduction to further
isolate, identify, and/or rule out any type of software defect with the Tivoli Storage Manager
application may be requested.
Other recommendations/suggestions, best practices, and notes:
Best practices:
Persistent binding should be enabled at the HBA layer on all involved machines in the lanfree
environment. This reduces device and path churn and can reduce failures.
Consider enabling the RESETDRIVES parameter on the library definition, if possible. This
option can allow the device driver to attempt(!) to break a reservation. It is important to note that
if the Persistent Reservation option is enabled on the HBA, RESETDRIVES cannot send a LUN
reset to break a reservation.
More information can be located in the following TechNote:
http://www-01.ibm.com/support/docview.wss?uid=swg21249613
Enable SANDISCOVERY to rediscover devices that have disappeared from the SAN. This can
often self-correct pathing issues.
http://www-01.ibm.com/support/docview.wss?uid=swg21257281
Significantly increasing the value may reduce reservation conflicts if the SAN is not healthy, and
devices are regularly disappearing and re-appearing on the SAN.
* If a drive is stuck in a reserved state, power cycling it at the physical hardware level can often
free the reservation, even if the holder is not known. This can temporarily provide relief for the
situation.
* Consider implementing fencing within the SAN zoning and/or via LUN masking to prevent
hosts that don't need to access the devices within the environment from accessing them. This can
help reduce the number of potential offending hosts.
Other Notes:
Users may also experience a perceived EBUSY error during other drive operations besides the
open. This includes during drive write operations. In such a case, the following error might be
present:
ANR8311E An I/O error occurred while accessing drive DRIVE1 (/dev/rmt1) for WRITE
operation, errno = 16.
While the errno may be consistent with an EBUSY reservation conflict, the root cause is most
likely to NOT be a reservation conflict. Most typically this indicates that the drive or SAN
switch is too busy to complete the WRITE operation and has asked for the application to retry
the request later.
The Tivoli Storage Manager device driver will automatically retry the operation every 10
seconds up to 90 times before failing the write operation with the ANR8311E. This functionality
exists in all currently supported Tivoli Storage Manager device driver code. The IBMtape device
driver, however, did not implement similar retry logic until version 12.3.7.0 (released in
12/2011). At 12.3.7.0, 50ms retry logic was introduced up to 4 minutes or 480 times. Upgrading
the appropriate device driver to current may resolve the issue.
If this doesn't resolve the error, further investigation at the device driver, SAN (switch), and tape
device layers should be completed to determine if the BUSY signal being returned is appropriate
or not.