High Number of Connection Resets During Transfers - Exchange 2013


#1

Hello Everyone!

I had been been testing an HAPROXY configuration with 2 Exchange 2013 servers. Between the documentation and some help from this forum, I was able to get a functional load balancer working for all exchange services. I ran a pilot test with roughly 30 users (Various versions of outlook and ActiveSync on both android and apple devices). Everything went well and I really didn’t receive and complaints or issues.

We’ve gone live with the configuration (full user base is roughly 300 users). While there have been no specific issues, I have noticed that Outlook clients intermittently take a bit to connect as well as pulling up things like shared calendars.

Everything on the Exchange side checks out. The only thing I’ve noticed is that (as the title says) There are a high number of connection resets during transfers. I feel like the volume of resets isn’t normal but I’m not sure what else I can adjust.

I have attached my configuration below, any assistance would be greatly appreciated!

global

log 127.0.0.1 local0 info

maxconn 10000

daemon

quiet

tune.ssl.default-dh-param 2048


ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECD$


ssl-default-server-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:E$

defaults

log global

mode http

option httplog

option dontlognull

timeout connect 60000ms

timeout client 30000ms

timeout server 60000ms

timeout check 60000ms

stats enable

stats hide-version

stats show-node

stats auth admin:password
stats uri /stats

frontend unsecured 1.2.3.4:80

redirect location https://mail.domain.com/owa

frontend fe_ex2013

mode http

bind *:443 ssl crt /etc/ssl/certs/exchange_certificate
acl autodiscover url_beg /Autodiscover

acl mapi url_beg /mapi

acl rpc url_beg /rpc

acl owa url_beg /owa

acl eas url_beg /microsoft-server-activesync

acl ecp url_beg /ecp

acl ews url_beg /ews

acl oab url_beg /oab

use_backend be_ex2013_autodiscover if autodiscover

use_backend be_ex2013_mapi if mapi

use_backend be_ex2013_rpc if rpc

use_backend be_ex2013_owa if owa

use_backend be_ex2013_eas if eas

use_backend be_ex2013_ecp if ecp

use_backend be_ex2013_ews if ews

use_backend be_ex2013_oab if oab

default_backend be_ex2013

backend be_ex2013_autodiscover

mode http

balance leastconn 

option httpchk GET /autodiscover/healthcheck.htm

0ption log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_mapi

mode http

balance leastconn 

option httpchk GET /mapi/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_rpc

mode http

balance leastconn 

option httpchk GET /rpc/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_owa

mode http

balance leastconn 

option httpchk GET /owa/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_eas

mode http

balance leastconn 

option httpchk GET /microsoft-server-activesync/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_ecp

mode http

balance leastconn 

option httpchk GET /ecp/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_ews

mode http

balance leastconn 

option httpchk GET /ews/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013_oab

mode http

balance leastconn 

option httpchk GET /oab/healthcheck.htm

option log-health-checks

http-check expect status 200

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

backend be_ex2013

mode http

balance leastconn 

server Cas1 10.10.10.31:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

server Cas2 10.10.10.28:443 check ssl inter 15s verify required ca-file /etc/ssl/certs/ca-bundle.crt

listen smtp *:25

mode tcp

option tcplog

balance leastconn

server Cas1 10.10.10.31:25 check

server Cas2 10.10.10.28:25 check


#2

Significant informations will only be in your haproxy log files, you cannot really take any conclusions out of the stats page.


#3

There seems to be a large number of entries with a 401 status code… Would it be within the forum rules to post a portion of the log file here?


#4

Just use pastebin or some similar service.


#5

Here is the link: https://pastebin.com/uEBrn3Zr

For a little background, roughly 90% of users are at a single site behind the same IP address (External IPs and hostnames have been removed from the logs).

I’ve been trying to track down the issue but I can’t tell if it’s an issues with the Exchange virtual directory Authentication (they are all pretty much left at the defaults), with the fact that my config doesn’t perform SSL offloading (SSL offloading is currently turned off on the CAS servers, I have tried it on as well), or with the HAPROXY configuration (L7 vs L4? Issue with session persistence?).

Once again, any help on this would be awesome.

Thanks.


#6

What you see is caused by your low timeout values.
Connections between the exchange servers and ActiveSync clients/outlook use long standing connections for their push mechanism. Basically they leave a TCP session open for up to 900 seconds without transmitting data. The idea is that the server will send back data once a change has happend that is “pushed”.

Now your defaults tell haproxy to drop a connection if it is idle more than 30/60 seconds:

timeout client 30000ms

timeout server 60000ms

You can also see that in your log which shows that the connection was closed by haproxy after 60 seconds (60127ms) due to the timeout being reached (code sD):

Apr 12 09:40:58 localhost haproxy[18690]: EXTERNAL-IP:57064 [12/Apr/2017:09:39:58.828] fe_ex2013~ be_ex2013_rpc/ExchCas2 3/0/0/9/**60127** 200 1601 - - **sD**-- 48/48/33/16/0 0/0 "RPC_OUT_DATA /rpc/rpcproxy.dll?7a41fb90-289c-42be-9704-a928a2b962f8@domain.com:6001 HTTP/1.1"

Details for troubleshooting the session state codes can be found in the documentation: https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#8.5

Change both of these to 1000s (900s would be enough but better to be safe than sorry) and the issue should disappear. Note that this does lead to a high number of open sessions, so you might need to increase those as well (in an environment with also about 300 users we see up to 4000 open sessions just from exchange alone).

To add on that: the 401 errors that were mentioned are expected due to how exchange authentication works


#7

If you use 1.7.3, that release also contains a regression. In that case you should upgrade to latest stable.


#8

You guys are great! Connection timeouts changed and the issue gone. I’ll look into upgrading the version as well.

Thanks to you both!