Summary
By default, the value of 'MaxStartups' is 10:30:100, which means if there are 10 ssh requests established and waiting for authorizing, then the subsequent request got 30% possibility to be refused, which means if you are trans files with a script via sftp and do NOT limit the concurrent requests may cause somebody ssh or rsh failed unexpected.
Trouble Shooting
Problem Summary
- GPFS node shows
unknownstate when executingmmgetstate -acommand - Lots of
refusedssh connections exist in file/var/log/secure
Common Causes
-
SSH port is wrong or not blocked
-
Credentials are incorrect
-
SSH Client is not installed correctly
-
SSH Daemon is not installed or not started correctly
SSH Connection Refused: What It Is, Causes, and 6 Effective Methods to Fix It
Possible Reasons
- Pool network quality
- High latency
- Package loss
- Package damaged
- Package reordering
- SSH daemon overload
Scripts & commands for testing
-
Networking traffic controlling tool
Name:ncUsage:- Make network delay 1000ms though device
eth0tc qdisc add dev eth0 root netem delay 1000ms - Make network loss 10% packages though device
eth0tc qdisc add dev eth0 root netem loss 10% - Make 0.2% of packages damaged randomly
tc qdisc add dev eth0 root netem corrupt 0.2% - Make 25% of packages be sent immediately and 50% delayed
tc qdisc change dev eth0 root netem delay 10ms reorder 25% 50%
- Make network delay 1000ms though device
-
Script for fetching
GPFSnodes’ state:mmstat.sh#!/bin/bash # log_file="/tmp/mmstate.log" # path of log file node_to_check="test01" # node name in GPFS cluster to test established_ssh= time_wait_ssh= mmresult= rsh_realtime= # 1. Function for counting SSH Connections function count() { count_ssh=`netstat -pt | grep sshd` established_ssh=$(echo "$count_ssh" | grep --count 'ESTABLISHED') time_wait_ssh=$(echo "$count_ssh" | grep --count 'TIME_WAIT') login_grace_time=`sshd -T | grep -i "LoginGraceTime" | grep -o "[0-9]*"` max_startups=`sshd -T | grep -i "MaxStartups" | grep -o "[0-9]*:[0-9]*:[0-9]*"` max_sessions=`sshd -T | grep -i "MaxSessions" | grep -o "[0-9]*"` echo "SSH Established: $established_ssh; SSH TIME_WAIT: $time_wait_ssh" | tee -a "$log_file" echo "SSH MaxStartups: $max_startups; LoginGraceTime: $time_wait_ssh; MaxSessions: $max_sessions" | tee -a "$log_file" } # 2. Function for fetching GPFS nodes' state function mmstate() { mmresult=`mmgetstate -a | grep -E "^[ ]{1,}[0-9]{1,}[ ]{1,}.*$" | sed 's/[ ]\{1,\}/ /g'` echo "$mmresult" | tee -a "$log_file" } function rshstate() { rsh_realtime=$({ time ssh root@"$node_to_check" date &>/dev/null; } 2>&1 | grep -i real | sed 's/real\t//g') echo "RSH Real Time: $rsh_realtime" | tee -a "$log_file" } while true; do echo '------------------' | tee -a "$log_file" date "+%Y/%m/%d %H:%M:%S" | tee -a "$log_file" count & mmstate & rshstate & wait echo done exit 0Usage:sh mmstat.shOutput:------------------ 2023/09/02 14:50:41 RSH Real Time: 0m0.239s SSH Established: 2; SSH TIME_WAIT: 0 SSH MaxStartups: 2:90:4; LoginGraceTime: 0; MaxSessions: 10 1 test01 active 2 test02 active -
Sending
SFTPrequest from local hostmain_sftp.sh#!/bin/bash # /usr/bin/expect /path/to/sftp.exp &>/dev/null &sftp.exp#!/usr/bin/expect set timeout 5 spawn sftp -oHostKeyAlgorithms=+ssh-dss root@21.21.78.101 expect "assword: " send "xx\n" expect "sftp> " send "put /Users/beyan/Desktop/config.csv\n" expect "sftp> " send "exit" exit 0Usage:for i in {1..100}; do sh /path/to/main_sftp.sh; doneNote: changing 100 to number of the multiple processes you want -
Sending
sshrequest but never loginmain_ssh.sh#!/bin/bash /usr/bin/expect /path/to/hold_login.exp &>/dev/null & exit 0hold_login.exp#!/usr/bin/expect set timeout 600 spawn ssh -oHostKeyAlgorithms=+ssh-dss root@21.21.78.101 expect "assword: "Usage:for i in {1..100}; do sh /path/to/main_ssh.sh; doneNote: changing 100 to number of the multiple processes you want
Test Environment
- GPFS Cluster
- node1: 21.21.78.101
- node2: 21.21.78.102
- GPFS Version
4.2.2.0
- Host OS Version
RHEL 6.5
- SSH Version
OpenSSH 5.3p1
[root@test01 ~]# mmdiag --version
=== mmdiag: version ===
Current GPFS build: "4.2.2.0 ".
Built on Nov 11 2016 at 11:51:09
Running 1 hour 30 minutes 35 secs, pid 2845
[root@test01 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)
[root@test01 ~]#
[root@test01 ~]# ssh -V
OpenSSH_5.3p1, OpenSSL 1.0.0-fips 29 Mar 2010
Verifying Possibility
-
High latency of network
-
Network package loss
-
Network package damaged
-
Network package reordering
Procedure
-
Logging the GPFS nodes’ state continuously
-
Manually manipulate the network traffic with the command provides above.
netem delay 1000msnetem loss 10%netem corrupt 0.2%netem delay 10ms reorder 25% 50%
-
Rechecking the state of the nodes
Result
Pool network quality does
notimpact the GPFS nodes’ state. -
-
SSH server overload
Procedure
- Logging
SSH&GPFSstate with scriptmmstat.sh - Multiple sending
SFTPrequestfor i in {1..1000}; do sh /path/to/main_sftp.sh; done - Checking script output
------------------ 2023/09/02 16:26:53 SSH Established: 1; SSH TIME_WAIT: 0 SSH MaxStartups: 10:30:100; LoginGraceTime: 0; MaxSessions: 10 RSH Real Time: 0m0.233s test01: ssh_exchange_identification: Connection closed by remote host mmdsh: test01 remote shell process had return code 255. 1 test01 unknown 2 test02 active - Checking
sshdDEBUGmessages/var/log/secure(needsshdDEBUG enabled)... Sep 2 16:26:53 test01 sshd[31642]: debug3: oom_adjust_restore Sep 2 16:26:53 test01 sshd[31642]: Set /proc/self/oom_score_adj to 0 ...
Result
1000
SFTPrequests cause OSout of memory(not much sure), which leads tommgetstateviasshfailed.Related Links
- Logging
Temporary summary
Typically, neither network issues nor sshd load (load can’t cause OOM) could impact an ssh request to fail.
Supposition
Default sshd configuration refuse additional ssh login request if some kind of queue is full.
SSH Daemon Configuration Tuning & Verifying
Options
MaxStartups
Specifies the maximum number of concurrent unauthenticated connections
to the SSH daemon. Additional connections will be dropped until
authentication succeeds or the LoginGraceTime expires for a connection.
The default is 10:30:100.
Alternatively, random early drop can be enabled by specifying the
three colon separated values “start:rate:full” (e.g. "10:30:60").
sshd(8) will refuse connection attempts with a probability of
“rate/100” (30%) if there are currently “start” (10) unauthenticated
connections. The probability increases linearly and all connection
attempts are refused if the number of unauthenticated connections
reaches “full” (60).
LoginGraceTime
The server disconnects after this time if the user has not successfully
logged in. If the value is 0, there is no time limit.
The default is 120 seconds.
MaxSessions
Specifies the maximum number of open sessions permitted per network
connection. The default is 10.
Procedure
-
Increasing the value of
LoginGraceTime, hold the password entering as long as possible
FILE:/etc/ssh/sshd_configLoginGraceTime 10m -
Decreasing the
startvalue to4(GPFS nodes already takes 2) -
Decreasing the
ratevalue to90(drop new connection with 90% probability) -
Decreasing the
fullvalue to6(refuse addition connections after 4 established)MaxStartups 4:90:6 -
Restart
sshdservice & verifying the optionsservice sshd restartsshd -T | grep -E "grace|startup" -
Logging the GPFS nodes’ state continuously
sh mmstate.sh------------------ 2023/09/02 17:23:19 SSH Established: 2; SSH TIME_WAIT: 0 SSH MaxStartups: 4:90:6; LoginGraceTime: 0; MaxSessions: 10 RSH Real Time: 0m0.167s 1 test01 active 2 test02 activeNote: 2 connections established.
-
Creating another 2 connections
ssh -oHostKeyAlgorithms=+ssh-dss root@21.21.78.101 -
Checking OS & GPFS state
------------------ 2023/09/02 17:25:29 SSH Established: 4; SSH TIME_WAIT: 0 SSH MaxStartups: 4:90:6; LoginGraceTime: 0; MaxSessions: 10 RSH Real Time: 0m0.166s 1 test01 active 2 test02 active -
Creating
onemore connection -
Checking script output
------------------ 2023/09/02 17:29:22 RSH Real Time: 0m0.174s SSH Established: 5; SSH TIME_WAIT: 0 SSH MaxStartups: 4:90:6; LoginGraceTime: 0; MaxSessions: 10 test01: ssh_exchange_identification: Connection closed by remote host mmdsh: test01 remote shell process had return code 255. 1 test01 unknown 2 test02 activeNote: 90% probability, try create more connection if
sshrequest not be refused.
One More Possibility
By default, the value of ‘MaxStartups’ is 10:30:100, which means if there are 10 ssh requests established and waiting for authorizing, then the next request got 30% possibility to be refused, which means if you are trans files with a script via sftp and do NOT limit the concurrent requests may cause somebody ssh or rsh failed unexpected.