一、前言
通过grep|wc的shell脚本命令统计分析nginx日志文件,并排除网络爬虫机器人(Spider/Bot)非用户访问记录,从而真实有效分析用户访问记录。
二、示例脚本
grep '07/Apr/2023' access.log @b@@b@ |grep -v 'YisouSpider'@b@ |grep -v 'DuckDuckGo'@b@ |grep -v 'Baiduspider'@b@ |grep -v 'Bytespider' @b@ |grep -v 'SemrushBot'@b@ |grep -v 'DataForSeoBot'@b@ |grep -v 'AhrefsBot'@b@ |grep -v 'Googlebot'@b@ |grep -v 'Baiduspider'@b@ |grep -v 'Bingbot'@b@ |grep -v 'bingbot' @b@ |grep -v 'YandexBot' @b@ |grep -v 'dotbot' @b@ |grep -v 'Sogou'@b@ |grep -v 'mj12bot'@b@ |grep -v 'bot'@b@ |grep -v 'spider' @b@ @b@|wc -l
脚本说明
grep '07/Apr/2023' access.log -- 用于搜索2023年4月07日那天nginx的access.log访问日志@b@@b@@b@grep -v 'YisouSpider' ... grep -v 'spider' - 排除网络排除访问记录
三、爬虫大全说明
1)YisouSpider - 神马搜索爬虫
"http://wwwdf.xwood.net/_site_domain_/_root/5870/5930/5932/25810/t_c3572621.html" @b@"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36"
2)DuckDuckGo - DuckDuckGo 的网页爬虫
20.191.45.212 - - [07/Apr/2023:08:57:25 +0800] "GET / HTTP/1.1" 200 306285 " @b@ "Mozilla/5.0 (compatible; DuckDuckGo-Favicons-Bot/1.0; +http://duckduckgo.com)"
3)Baiduspider - 百度爬虫
116.179.32.38 - - [07/Apr/2023:08:56:52 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/9383/t_c155285.html HTTP/1.1" @b@304 0 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
4)Bytespider字节跳动爬虫
110.249.202.205 - - [07/Apr/2023:08:56:47 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/9372/t_c81627.html HTTP/1.1" @b@200 10114 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/)"
5)SemrushBot - SEO、营销公司的网络爬虫
185.191.171.33 - - [07/Apr/2023:08:56:02 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/9352/t_c18848.html HTTP/1.1" @b@200 103106 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"
6)DataForSeoBot - 德国 萨克森自由州 法尔肯施泰因爬虫
136.243.228.179 - - [07/Apr/2023:08:56:02 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/14718/15164/t_c229841.html HTTP/1.1" @b@200 14247 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)"
7)AhrefsBot - 知名SEO公司Ahrefs的网页爬虫
51.222.253.15 - - [07/Apr/2023:08:51:41 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/9381/t_c133727.html@b@ HTTP/1.1" 200 77227 "-" "Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/)"
8)Googlebot google公司爬虫
66.249.68.18 - - [07/Apr/2023:08:47:40 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/9376/t_c103441.html HTTP/1.1"@b@ 200 10204 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) @b@ Chrome/111.0.5563.146 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
9)bingbot - 微软必应爬虫
52.167.144.88 - - [07/Apr/2023:08:47:19 +0800] "GET /_site_domain_/_root/5870/5930/5932/9330/14708/14927/t_c226764.html HTTP/1.1" @b@200 14492 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/103.0.5060.134 Safari/537.36"
10)YandexBot - 俄罗斯最大搜索引擎和互联网巨头爬虫
213.180.203.79 - - [07/Apr/2023:08:37:36 +0800] "GET /_site_domain_/_root/5870/5874/t_c283100.html HTTP/1.1" @b@200 15670 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
11)DotBot - Moz.com 的网页爬虫
216.244.66.234 - - [07/Apr/2023:07:57:25 +0800] "GET /_site_domain_/_root/5870/5874/t_c261846.html HTTP/1.1" @b@200 19415 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)"
12)Sogou - 搜狗爬虫
123.125.109.47 - - [07/Apr/2023:06:13:58 +0800] "GET /_site_domain_/_root/5870/5930/5932/25810/t_c1379651.html HTTP/1.1"@b@ 200 13761 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"
13)MJ12bot - 英国著名SEO公司Majestic的网络爬虫
62.138.2.243 - - [07/Apr/2023:05:18:46 +0800] "GET /_site_domain_/_root/5870/5874/t_c279787.html HTTP/1.1" @b@200 16822 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"