一、前言
通过对nginx的nginx.conf配置中max_fails、fail_timeout、proxy_connect_timeout等参数的优化配置达到系统故障的,具体参考配置如下
proxy_next_upstream error timeout http_500 http_502 http_504; @b@@b@upstream uncritical{@b@ server 192.168.1.16:8080;@b@ server 192.168.1.15:8088 max_fails=1 fail_timeout=30s; //故障服务器发现后, 禁用的 时长 @b@ keepalive 128;@b@ } @b@@b@location /gw/openapi/{@b@ proxy_pass http://uncritical;@b@ proxy_http_version 1.1;@b@ proxy_set_header Connection "";@b@ proxy_redirect off; @b@ proxy_set_header Host $http_host; @b@ proxy_set_header X-Real-IP $remote_addr;@b@ proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;@b@ proxy_set_header Cookie $http_cookie; @b@ add_header Access-Control-Allow-Origin $corsHost;@b@ proxy_connect_timeout 30; #正常该成2秒@b@ #proxy_read_timeout 60;@b@ }
二、实践说明
1.测试入口地址(192.168.1.16/15是主备份应用服务器地址) - curl "http://192.168.1.10/gw/openapi/test.action"
2.访问nginx的Access.log日志,正常日志如下
-[13/Jan/2017:09:52:36 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1" @b@ @b@ uip:192.168.1.16:8080 status:200 res_bs:90 - req_time:0.090 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.
3.出现故障,并转移, 两个客户端都会遇到,请求时间超过30秒
-[13/Jan/2017:09:53:03 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1"@b@@b@ uip:192.168.1.15:8088, 192.168.1.16:8080 status:200 res_bs:90 - req_time:30.003 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"@b@ @b@ @b@ @b@-[13/Jan/2017:09:53:09 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1" @b@@b@ uip:192.168.1.15:8088, 192.168.1.16:8080 status:200 res_bs:90 - req_time:30.005 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
4.系统去掉失效的服务器, 以后的访问正常
-[13/Jan/2017:09:53:10 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1"@b@@b@ uip:192.168.1.16:8080 status:200 res_bs:90 - req_time:0.003 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"@b@ @b@ @b@ @b@-[13/Jan/2017:09:53:32 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1" @b@@b@ uip:192.168.1.16:8080 status:200 res_bs:90 - req_time:0.002 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
5.30秒后失效服务器重新启动,又触发失效转移,请求时间超过30秒
-[13/Jan/2017:09:54:11 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1"@b@@b@ uip:192.168.1.15:8088, 192.168.1.16:8080 status:200 res_bs:90 - req_time:30.005 "curl/7.19.7@b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
6.系统去掉失效的服务器, 以后的访问正常
-[13/Jan/2017:09:54:11 +0800] 192.168.1.10 "GET /gw/openapi/test.action HTTP/1.1"@b@@b@ uip:192.168.1.16:8080 status:200 res_bs:90 - req_time:0.002 "curl/7.19.7 @b@ @b@ (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.3.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2"
7.错误日志 error.log
#第一次失效,访问后端失效的服务器192.168.1.15:8088, 两个客户端个请求一次.@b@2017/01/13 09:53:03 [error] 17216#17216: *7208955 upstream timed out (110: Connection timed out) @b@@b@ while connecting to upstream, client: 192.168.1.10, server: 192.168.1.10, @b@ @b@ request: "GET /gw/openapi/test.action HTTP/1.1", upstream: "http://192.168.1.15:8088/gw/openapi/test.action", host: "192.168.1.10"@b@ @b@ @b@ @b@2017/01/13 09:53:09 [error] 17216#17216: *7208962 upstream timed out (110: Connection timed out)@b@ @b@ while connecting to upstream, client: 192.168.1.10, server: 192.168.1.10,@b@ @b@ request: "GET /gw/openapi/test.action HTTP/1.1", upstream: "http://192.168.1.15:8088/gw/openapi/test.action", host: "192.168.1.10" @b@@b@ @b@#第二次失效.@b@2017/01/13 09:54:11 [error] 17216#17216: *7209022 upstream timed out (110: Connection timed out) @b@@b@ while connecting to upstream, client: 192.168.1.10, server: 192.168.1.10,@b@@b@ request: "GET /gw/openapi/test.action HTTP/1.1", upstream: "http://192.168.1.15:8088/gw/openapi/test.action", host: "192.168.1.10"
三、结论
失效转移工作正常,生产配置的时候连接超时时间可以设定为2秒, nginx和后端是局域网,可以短一些问题:
1.服务器失效后,在超时时间内(假设2秒) 所有轮询到该后端服务器的连接请求都会有请求延时。
2.故障恢复时间设定太长也会影响服务器正常时,快速恢复启用,在网络闪短时不能机会恢复服务。太短又会频繁发生第一条的延时请求。
建议设定时间连接超时2秒, 故障禁用时长30秒。