NBU TS Net 1

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 7

# bptestbpcd -host fred -connect_options 0 0 2

002
10.0.0.32:748 -> 10.0.0.59:13782
10.0.0.32:983 <- 10.0.0.59:635

A connection from the server fred is tried to the client wilma by using the vnetd port number if
possible:
# bptestbpcd -M fred -client wilma -connect_options 2 2 0 1 1 1
10.0.0.59:40983 -> 10.0.0.104:13724
10.0.0.59:40984 -> 10.0.0.104:13724
10.0.0.59:40985 -> 10.0.0.104:13724

This command will make connections to client on the vnetd port instead of pbx?
And from next time backup will use vnetd port to connect to client?

The connect option only apply to the connection bptesbpcd is about to perform.

If you want the connect option to be permanent you have to place the option in bp.conf on the master
and media servers.

The first setting indicates the type of port to use to connect to bpcd on the client:
0 = Use a reserved port number.

The second setting indicates the bpcd call-back method to use to connect to the client:
0 = Use the traditional call-back method

The third setting indicates the connection method to use to connect the client:
2 = Connect to a daemon on the server by using the traditional port number of the daemon only.

Description
The information below is accurate for the specific version of NetBackup that is targeted. The details
relevant to each NetBackup version can be found in this article:
000017676
Best Practices for bptestnetconn including arguments and outputs by NetBackup version

Using the following command can test a connection from a NetBackup host, a master server for exam
ple, to another NetBackup host, like a media server and to a service that should be running on that
host. Here is an example command:
$ bptestnetconn -v -cnbrmms/DiskPollingService.DPS -t 10 -o 5 -H mymm

If there is a perceived problem with the master server polling disk pools on a media server, a test to
the nbrmms service can be done. This example shows SUCCESS, so the remote host is reachable and
the service is running.
$ bptestnetconn -v -cnbrmms/DiskPollingService.DPS -t 10 -o 5 -H mymm
adding hostname = mymm
------------------------------------------------------------------------
Connecting to 'nbrmms/DiskPollingService.DPS'
CN: mymm : 80 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
------------------------------------------------------------------------
Total elapsed time: 0 sec
The connection test will make a connection to PBX which is passed to the nbrmms service. Here is
what the behavior should look like in the logs when a successful connection is made:

From the PBX debug log on from the media server:

12/07/2011 07:58:08.024 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File


ID:103 [No context] [Info] PBX_Manager:: handle_input with fd = 7
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] PBX_Client_Proxy::parse_line, line = ack=1 From 10.10.6.101
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] PBX_Client_Proxy::parse_line, line = extension=nbrmms From
10.10.6.101
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] hand_off looking for proxy for = nbrmms
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] is_accepting about to return true
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] Client Expects an ACK, sent one.
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] Proxy found.
12/07/2011 07:58:08.025 [Application] VxICS 50936 103 PID:2591 TID:47774493297840 File
ID:103 [No context] [Info] PBX_Client_Proxy::handle_close

Because the connection to PBX is passed to the nbrmms service, the nbrmms debug log will show the
connection arriving from the IP address of the master server.

12/07/2011 07:58:08.027 [Debug] NB 51216 libraries 137 PID:28355 TID:1123518784 File ID:222
[No context] 1 [vnet_cached_getaddrinfo_and_update] ../../libvlibs/vnet_addrinfo.c.1370:
1123518784: found in cache name: 10.10.6.101
12/07/2011 07:58:08.027 [Debug] NB 51216 libraries 137 PID:28355 TID:1123518784 File ID:222
[No context] 1 [vnet_cached_getaddrinfo_and_update] ../../libvlibs/vnet_addrinfo.c.1371:
1123518784: found in cache service: NULL
12/07/2011 07:58:08.027 [Debug] NB 51216 libraries 137 PID:28355 TID:1123518784 File ID:222
[No context] 1 [vnet_cached_getaddrinfo_and_update] ../../libvlibs/vnet_addrinfo.c.1514:
1123518784: found in file cache name: masterserver.name.local
12/07/2011 07:58:08.027 [Debug] NB 51216 libraries 137 PID:28355 TID:1123518784 File ID:222
[No context] 1 [vnet_cached_getaddrinfo_and_update] ../../libvlibs/vnet_addrinfo.c.1515:
1123518784: found in file cache service: NULL

The verbose (-v) argument to bptestnetconn also caused a connection to the bpcd process on the media
server. This is a single connection to prove bpcd is reachable, and does not attempt to bring up a a call
-back connection so the bpcd debug log shows the initial connection and then a failure when bptestnet
conn closes the initial socket without providing the information for the call-back connection.

07:58:08.020 [20273] <2> logconnections: BPCD ACCEPT FROM 10.10.6.101.33729 TO


10.10.6.102.13782 fd = 0
...snip...
07:58:08.028 [20273] <2> bpcd peer_hostname: Connection from host masterserver.name.local
(10.10.6.101) port 33729
07:58:08.028 [20273] <2> bpcd valid_server: comparing masterserver.name.local and
masterserver.name.local
07:58:08.028 [20273] <4> bpcd valid_server: hostname comparison succeeded
07:58:08.028 [20273] <2> vnet_cached_getaddrinfo_and_update: ../../libvlibs/vnet_addrinfo.c.1514:
0: found in file cache name: masterserver.name.local
07:58:08.028 [20273] <2> vnet_cached_getaddrinfo_and_update: ../../libvlibs/vnet_addrinfo.c.1515:
0: found in file cache service: NULL
07:58:08.028 [20273] <16> process_requests: read failed: No such file or directory

Similarly the verbose (-v) argument also causes a connection to the vnetd service on the media server.
Once the connection is complete, bptestnetconn closes the connection without negotiating the vnetd
protocol or sending across a vnetd command as show in the vnetd debug log on the media server.

07:58:08.015 [20272] <2> ProcessRequests: vnetd.c.587: 0: msg: VNETD ACCEPT FROM


10.10.6.101.41796 TO 10.10.6.102.13724 fd = 9
07:58:08.015 [20272] <2> vnet_pop_byte: ../../libvlibs/vnet.c.1159: 0: errno: 0 0x00000000
07:58:08.015 [20272] <2> vnet_pop_byte: ../../libvlibs/vnet.c.1161: 0: Function failed: 9 0x00000009
07:58:08.015 [20272] <2> vnet_pop_string: ../../libvlibs/vnet.c.1241: 0: Function failed: 9
0x00000009
07:58:08.015 [20272] <2> vnet_pop_signed: ../../libvlibs/vnet.c.1285: 0: Function failed: 9
0x00000009
07:58:08.015 [20272] <2> vnet_version_accept: vnetd.c.1023: 0: Function failed: 9 0x00000009
07:58:08.015 [20272] <2> ProcessRequests: vnetd.c.591: 0: version_accept failed: 9 0x00000009
07:58:08.015 [20272] <2> main: vnetd.c.519: 0: ProcessRequests returned: 9 0x00000009

The bptestnetconn program can be used to test connectivity to any valid CORBA using service/object.
See the Related Articles for a list of some of the services and objects.

All configured NetBackup servers can be tested at one time by replacing the host (-H host) argument
with the server (-s) argument.

$ bptestnetconn -v -c -o 5 -t 10 -s
SERVER = mymaster
SERVER = myadmin
SERVER = myoldmm
MEDIA_SERVER = mymm
------------------------------------------------------------------------
Connecting to 'nbsl/HSFactory'
CN: mymaster : 11 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
CN: myadmin : 4 sec [TRANSIENT] PBX: Yes VNETD: Yes BPCD: Yes
CN: myoldmm : 4 sec [TRANSIENT] PBX: No VNETD: No BPCD: No
CN: mymm : 12 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
------------------------------------------------------------------------
Total elapsed time: 17 sec
In the above example, the orbtimeout (-t 10) and orbobjtimeout (-o 5) arguments were specified.
These limit the time permitted for the connection to the PBX and the service/object respectively.
When a connection fails, the length of the timeout will indicate which portion of the connection failed.

In the above test we see two hosts with SUCCESS and two hosts with TRANSIENT. The myadmin
host is showing TRANSIENT because it does not run the nbsl service because it is a Windows
Administration Console host. The myoldmm host is on the network, but both PBX and NetBackup
are shutdown, ideally this host should be removed from the NetBackup configuration. The 4 second
timeout confirms that a network route exists, but that the service and/or object was not available.

The customer conducts a checkpoint every 7 minutes. One of his clients, bkoweb26, began to backup
successfully after creating an exclusions list. Also noticed a timeout of infinity in bpbrm, see snippet.
We have made suggestions and recommendations, below. Some were made, but not all.

a) Snippet of the backup log:


Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection
broken (40)
Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP
BACKUP bkoweb26_1585391368
Mar 28, 2020 5:53:17 AM - Info bpbrm (pid=7471230) media manager for backup id
bkoweb26_1585391368 exited with status 150: termination requested by administrator
Mar 28, 2020 5:53:17 AM - end writing; write time: 0:23:42
network connection broken (40)

b) Snippet of Media bpbrm


00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite)
00:07:52.390 [18415728.1] <2> db_end: Need to collect reply
00:07:52.390 [18415728.1] <2> db_getdata: timeout is 0 (infinite)

Environment info:
Master: nmbackup01, configured on third-party, NBU version 8.1.1, Platform: AIX ver. 7.1
Media: nmbpmed05, configured on third-party server, NBU version 8.1.1, Platform AIX ver. 7.1
Clients: nmocmi02, bkoweb26 and bkoweb25

Recommendations:
1) Increase checkpoints to every 60min. (not done)
2) Decrease timeouts to 7200 instead of infinite (not done). I'm thinking these are
CLIENT_READ_TIMEOUT or CLIENT_CONNECT_TIMEOUT = 7200 in bp.conf
3) Since exclusions helped one client succeed, I wondered if the test in this technote could apply to
this situation: https://www.veritas.com/support/en_US/article.100003560
4) Check for communication related patches (tried but did not find any online)
5) Run bppllist and bpplinfo on media. Run bpgetconfig from a failing and a successful client, to
compare. (not done)

Use vxlogview -p 50936 -o 103, or check the /var/adm/syslog for the PBX logs written by the O/S.

Would also help to check the bprd logs on the master. You'll see the connection attempts to the
master, and you may see the cause of the network disruption.
Status 40 is generally related to another process terminating the connection, like a firewall or
something on the system interrupting the connection as opposed to something timing out.

We need to see ALL text in job details to show timestamps and PIDs

Can you please point out which logs exactly contain the specific job, timestamps and PID reflected
here:
5:53:15 AM - Info bpbrm (pid=7471230)
Important to look at one set of logs when you troubleshoot - see which client and which media server
exactly, what the timestamps and PIDs are, then follow the process flow in the relevant logs.

For example, it does not help to compare this activity monitor entry with the bpbrm log snippet that
happened at a different time and with different PIDs. Are these even the same media server?:

Mar 28, 2020 5:53:14 AM - Error bpbrm (pid=10289380) db_FLISTsend failed: network connection
broken (40)
Mar 28, 2020 5:53:15 AM - Info bpbrm (pid=7471230) sending message to media manager: STOP
BACKUP bkoweb26_1585391368
00:07:52.377 [18415728.1] <2> db_getdata: timeout is 0 (infinite)

Which timeouts exactly are configured as 0 (infinite) ?


IMHO, there is hardly ever any reason to increase Cliet Connect and Client Read timeouts to more
than 1800 (The default is 300, which are in most instances sufficient.)

About status 40:


The error says the connection is broken outside of NBU - therefore not a NBU timeout or failure.

Level 3 bpbrm and bptm logs on media server will confirm if data was received from client in a
continuous stream when network failure occurred.

Always best to involve network and firewall team in situations like these - they need to monitor the
port connections while the backup is running.

The issue you posted closely resembles the issue at hand because (I failed to mention) only FULL
backups are failing with this exit status. INCs are successfully backing up.

To answer the questions you posed in that discussion:


1) Master and Media are separate.
2) Unsure about a FW between the client, master and media, the question alone was taboo. However,
we received a bptestnetconn output and all is well. See attached.
3) I also believe this backup is failing due to the amount of data in a full vs. an inc. It makes sense that
the media is waiting for the set timeout but a FW may not be set to the same timeout and has already
dropped the connection.
4) I will run with your suggestions: activate accelerator and run a full, try a test dedup, check
switches/router/clients NIC if they are on full duplex and not half.

And here lies the frustration...I kept getting half the conversation no matter how many times I re-
requested logs. So the job detail I posted was the original job log. I can post the whole log, but it
seems moot since that bkoweb26 client is now backing up successfully.
"Which timeouts exactly are configured as 0 (infinite)?"

I have no idea, nor does my senior engineer. I only mentioned having it decreased to 7200 because it
appears all others on his server are set at that, aside from the infinite value, which I am guessing was
purposely performed.

Do you think checkpoints are an issue? If not, I will stop harping him on that.

"The error says the connection is broken outside of NBU."

Could you point me to where you could deduce that? That is helpful to know.

The logs attached are the most recent bpbrm and bptm logs at a verbose level 3. I'd really love to
involve their networking team, but we received pushback, until we have concrete evidence it's
network or FW related.

I've been in that situation before, it is quite frustrating to constantly need the burden of proof to
receive logs.

Can you post the Detailed Status from the failing client?

Would also help to see the output from bptestbpcd -client <client name> -verbose -debug

This will give you a detailed output of the connection attempt through BPCD to the client.

What kind of O/S is the client? If it's Windows, look at Windows Defender/Firewall. If it's Linux, take
a look at the iptables.

Is this a recent occurrence, or has this client always had issues? Lots of background info needed, but
based off of what you've said you should focus your troubleshooting on the client.

Are you using Accelerator for this backup?

I have requested a bptestbpcd, bpgetconfig -e, vxlogview and bppllist -allpolicies from the failing
client and master. It turned out @Marianne was right, we've been looking at the wrong media server,
which we realized when we did not see the client listed in the media's policies. So, I mustered up the
courage to ask for the full gammut of logs (again) by hostname, and not by item, this time. Stay tuned
for that info...

The media and master are running AIX and the clients are running RHEL7.7. I did not look into
iptables because smaller fulls and incs are completing successfully. Something must have changed
because this is the first time this client is having any issues (I've checked into the last year). They use
Accelerator in some of their policies, but I did not see it in this client's log. I will find out soon though.

I believe everyone here is on the same page and know it's a network or FW issue. I have a strong
feeling it's the FW timeout, but I HAVE to give the proof before suggesting it. So, your keen eyes and
patience on this will be much appreciated!

This is communication gap or issue within master and media servers. I have faced similar situation
and really hard to pin-point the problem.
Please verify if you are using multiple nic's or ip's to communicate with media servers. Try using only
one IP or nic on the master server to communicate with all the media server and check if it helps.

I just received a response. They have a dedicated NIC for backups with only a single IP configured.

Only the vxlogs for days of failure and that he only has a dedicated NIC for backups, no multiple IP
configs.

He said he had "no binary file is created for bpgetconfig" and left it at that. I have the vxlogs you
requested, I looked at it and didn't know what to really look for or how to tell if something is off.
Please let me know if you see anything out of the ordinary.

You might also like