ELK专题:Day3——添加自定义字段并通过Geoip插件识别IP地址的地理位置

770 阅读6分钟

1. 前言

我们在上一篇文章《ELK专题:Day2——Logstash配置Nginx日志分析》中围绕着Web业务日志采集这个场景初步完成了必要的ELK Stack调试,但我们只是完成了一个雏形。在实际的运维场景中,还需要进行一些必要的补充,我们将会在后面继续探讨这个主题。

2. 场景及解决方案

2.1 添加字段

在现实的运维工作中,我们一套业务都会有多个环境,一般最少会分为开发环境(dev)、线上测试环境(test)、生产环境(live)。虽然我们可以通过logstash中自动生成的agent.hostname字段去区分日志来源,但再考虑到其他的需求或场景,如:区分同一台服务器中的日志类型、对同一个日志文件里面的个别内容做差异化处理等,我们需要给日志增加字段。

2.1.1 在filebeat中添加字段

2.1.1.1 filebeat配置示例

我们可以在filebeat的配置文件中使用fields配置项增加字段,fields的内容可以是字符串、数组或者字典。配置示例如下:

filebeat.inputs:
- type: log
  enabled: true
  paths:
   - /var/log/nginx/hexo_access.log
  fields:
    env: test #标识为测试环境
    nginx_log_type: access
- type: log
  enabled: true
  paths:
   - /var/log/nginx/hexo_error.log
  fields:
    env: test
    nginx_log_type: error #标识为errorlog

参考文档:www.elastic.co/guide/en/be…

2.1.1.2 logstash对自定义字段的处理

在filebeat的配置中添加了自定义的字段之后,当日志内容进入到logstash,会体现为fields.nginx_log_type: access这样的形式,想要在logstash中显式地对自定义字段进行过滤和匹配,需要配置如下:

...
filter {
  if [fields][nginx_log_type] == "access" {
    grok { ... }
  }
}
...

关于logstash pipeline配置文件中对于if的使用,参考文档:www.elastic.co/guide/en/lo…

2.1.2 在logstash中添加字段

logstash的pipeline配置里面支持通过插件mutate添加或者修改字段,我们使用配置示例如下:

input { stdin { } }
filter {
  grok {
    match => { "message" => "%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:uri} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} \"%{DATA:http_referrer}\" \"%{DATA:http_user_agent}\"" }
  }
  date {
    match => [ "time_local", "dd/MMM/yyyy:HH:mm:ss Z" ]
  }
  mutate {
    add_field => {
      "new_field" => "new_static_value"
      "foo_%{request_method}" => "something different from %{uri}"
    }
  }
}
output {
  stdout { codec => rubydebug }
}

我们可以看到,在插入字段的时候,我们还可以使用变量去对字段内容进行修改,测试结果如下:

{
     "request_method" => "GET",
      "response_code" => "304",
        "remote_user" => "-",
               "host" => "logstash",
            "message" => "202.105.107.186 - - [18/Aug/2021:11:47:13 +0800] \"GET /images/alipay.png HTTP/1.1\" 304 0 \"http://rondochen.com/ELK2/\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36\"",
    "http_user_agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36",
         "@timestamp" => 2021-08-18T03:47:13.000Z,
          "new_field" => "new_static_value",
      "http_referrer" => "http://rondochen.com/ELK2/",
        "remote_addr" => "202.105.107.186",
         "time_local" => "18/Aug/2021:11:47:13 +0800",
    "body_sent_bytes" => "0",
       "http_version" => "1.1",
           "@version" => "1",
                "uri" => "/images/alipay.png",
            "foo_GET" => "something different from /images/alipay.png"
}

参考文档:www.elastic.co/guide/en/lo…

2.2 识别IP地址的地理位置

在常见的业务分析场景里面,我们往往需要对访问来源进行统计,比如说查出网站内访问量最高的文章,或者找出访问量最密集的时间点,甚至只是简单地统计页面打开的速度,我们都可以通过前面我们已经做好的日志内容检索去入手。但如果我们还想知道访问来源,在访问来源中统计出哪个省份或者城市的用户最多,就需要对IP地址进行识别了。

Logstash提供了插件geoip,通过GeoLite2自动识别IP地址所在的区域,并自动添加需要的字段。示例配置如下:

input {}

filter {
    grok { ... }
    }
    geoip {
      source => "remote_addr"
      target => "geoip"
    }
}

output {}

需要注意的是,必须要先完成日志内容的识别后,向geoip插件提供ip地址或者域名信息,提供到geoip的source字段,地址才可以被正确识别。同时通过target配置项指定geoip的识别结果会被组织到一个命名为geoip的字段中。

初次使用geoip的时候,可能需要等待几分钟的时候,待GeoLite2的数据库完成初始化,才可以正常工作。

经过地理位置识别后,返回的结果示例如下:

{
...
              "geoip" => {
          "country_name" => "China",
        "continent_code" => "AS",
              "location" => {
            "lat" => 22.5333,
            "lon" => 114.1333
        },
         "country_code2" => "CN",
             "city_name" => "Shenzhen",
                    "ip" => "14.154.29.133",
           "region_name" => "Guangdong",
             "longitude" => 114.1333,
         "country_code3" => "CN",
              "timezone" => "Asia/Shanghai",
           "region_code" => "GD",
              "latitude" => 22.5333
    }
}

参考文档:www.elastic.co/guide/en/lo…

2.3 Pattern集中管理

在上一篇博文中,我们在pipeline配置文件中使用grok插件直接配置pattern完成了对日志内容的识别:

input { stdin { } }
filter {
  ...
  grok {
    match => { "message" => "%{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:uri} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} \"%{DATA:http_referrer}\" \"%{DATA:http_user_agent}\"" }
  }
  ...
}
output {
  stdout { codec => rubydebug }
}

这样的配置方式存在着不易维护的问题,当我们对同一种日志格式有多个pipeline配置文件的时候,我们每次改动日志格式,都需要修改多个pipeline配置文件,而且这种配置方式也使配置文件显得过于凌乱。

我们可以通过如下方式去统一维护各个pattern:

  1. 创建文件/etc/logstash/pattern.d/mypattern

  2. 在文件mypattern中放置我们需要的pattern,内容如下:

    NGINXCOMBINEDLOG %{IPORHOST:remote_addr} - %{DATA:remote_user} \[%{HTTPDATE:time_local}\] \"%{WORD:request_method} %{DATA:uri} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} \"%{DATA:http_referrer}\" \"%{DATA:http_user_agent}\"
    
  3. 把pipeline配置文件修改成如下形式:

    input {  }
    filter {
      grok {
        patterns_dir => [ "/etc/logstash/pattern.d" ]
        match => { "message" => "%{NGINXCOMBINEDLOG}" }
      }
    ...
    }
    output {  }
    

这样,当我们需要修改pattern的时候,只需要修改一个文件,就可以在多个pipeline中生效了。

除此之外,logstash的开发者为我们提供了很多常见的日志格式的pattern,我们可以直接下载引用:

github.com/logstash-pl…

patterns_dir参考文档:www.elastic.co/guide/en/lo…

2.4 整合配置文件

综合我们上面提到的几点优化,最后我们使用的配置文件如下:

  1. filebeat配置

    logging.level: info
    logging.to_files: true
    logging.files:
      path: /var/log/filebeat
      name: filebeat
      keepfiles: 7
    
    filebeat.inputs:
    - type: log
      enabled: true
      paths:
       - /var/log/nginx/hexo_access.log
      fields:
        env: test
        nginx_log_type: access
    - type: log
      enabled: true
      paths:
       - /var/log/nginx/hexo_error.log
      fields:
        env: test
        nginx_log_type: error
    
    setup.template.settings:
      index.number_of_shards: 1
    
    output.logstash:
      hosts: ["192.168.0.211:5400"]
    
  2. logstash pipeline配置

    input {
            beats {
                    host => "0.0.0.0"
                    port => 5400
            }
    }
    
    filter {
      if [fields][nginx_log_type] == "access" {
        grok {
          patterns_dir => ["/etc/logstash/pattern.d"]
          match => { "message" => "%{NGINXCOMBINEDLOG}" }
        }
        date {
          match => [ "time_local", "dd/MMM/yyyy:HH:mm:ss Z" ]
        }
        geoip {
          source => "remote_addr"
        }
      }
    }
    
    output {
            elasticsearch { 
                    hosts => ["192.168.0.212:9200"] 
                    index => "rc_index_pattern-%{+YYYY.MM.dd}"
            }
    }
    

修改配置文件后,重启服务使配置生效。

Tips:

logstash动态加载配置文件方法:

kill -SIGHUP ${logstash_pid}

3. 总结

在上面的实践中,我们对logstash和filebeat的配置完成了少量的优化,使其更接近我们实际的生产场景。经过优化后,我们可以在kibana中通过字段field.env区分生产环境和测试环境的流量。在后面,我们就可以开始对我们的网站进行业务分析了。

点击展开完整json

{
  "_index": "rc_index_pattern-2021.08.18",
  "_type": "_doc",
  "_id": "oPxeV3sB-S8uwXpk0k-d",
  "_version": 1,
  "_score": null,
  "fields": {
    "agent.version.keyword": [
      "7.13.4"
    ],
    "http_referrer.keyword": [
      "http://rondochen.com/"
    ],
    "geoip.timezone": [
      "Asia/Shanghai"
    ],
    "remote_addr.keyword": [
      "202.105.107.186"
    ],
    "host.name.keyword": [
      "hexo"
    ],
    "geoip.region_name.keyword": [
      "Guangdong"
    ],
    "geoip.country_code2.keyword": [
      "CN"
    ],
    "geoip.country_name.keyword": [
      "China"
    ],
    "agent.hostname.keyword": [
      "hexo"
    ],
    "request_method.keyword": [
      "GET"
    ],
    "remote_user": [
      "-"
    ],
    "ecs.version.keyword": [
      "1.8.0"
    ],
    "geoip.region_code.keyword": [
      "GD"
    ],
    "geoip.city_name.keyword": [
      "Shenzhen"
    ],
    "agent.name": [
      "hexo"
    ],
    "host.name": [
      "hexo"
    ],
    "geoip.longitude": [
      114.1333
    ],
    "fields.env.keyword": [
      "test"
    ],
    "geoip.location.lat": [
      22.5333
    ],
    "agent.id.keyword": [
      "d2f43da1-5024-4000-9251-0bcc8fc10697"
    ],
    "http_version": [
      "1.1"
    ],
    "time_local": [
      "18/Aug/2021:11:47:13 +0800"
    ],
    "@version.keyword": [
      "1"
    ],
    "geoip.region_name": [
      "Guangdong"
    ],
    "input.type": [
      "log"
    ],
    "log.offset": [
      2593
    ],
    "agent.hostname": [
      "hexo"
    ],
    "tags": [
      "beats_input_codec_plain_applied"
    ],
    "agent.id": [
      "d2f43da1-5024-4000-9251-0bcc8fc10697"
    ],
    "geoip.continent_code.keyword": [
      "AS"
    ],
    "ecs.version": [
      "1.8.0"
    ],
    "message.keyword": [
      "202.105.107.186 - - [18/Aug/2021:11:47:13 +0800] \"GET /ELK2/ HTTP/1.1\" 200 15101 \"http://rondochen.com/\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36\""
    ],
    "body_sent_bytes": [
      "15101"
    ],
    "geoip.latitude": [
      22.5333
    ],
    "http_referrer": [
      "http://rondochen.com/"
    ],
    "agent.version": [
      "7.13.4"
    ],
    "geoip.continent_code": [
      "AS"
    ],
    "response_code": [
      "200"
    ],
    "input.type.keyword": [
      "log"
    ],
    "geoip.region_code": [
      "GD"
    ],
    "tags.keyword": [
      "beats_input_codec_plain_applied"
    ],
    "remote_user.keyword": [
      "-"
    ],
    "geoip.country_code3.keyword": [
      "CN"
    ],
    "request_method": [
      "GET"
    ],
    "fields.nginx_log_type": [
      "access"
    ],
    "geoip.ip": [
      "202.105.107.186"
    ],
    "fields.nginx_log_type.keyword": [
      "access"
    ],
    "http_user_agent": [
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
    ],
    "uri.keyword": [
      "/ELK2/"
    ],
    "agent.type": [
      "filebeat"
    ],
    "geoip.country_code3": [
      "CN"
    ],
    "geoip.country_code2": [
      "CN"
    ],
    "http_user_agent.keyword": [
      "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"
    ],
    "@version": [
      "1"
    ],
    "geoip.country_name": [
      "China"
    ],
    "log.file.path.keyword": [
      "/var/log/nginx/hexo_access.log"
    ],
    "http_version.keyword": [
      "1.1"
    ],
    "agent.type.keyword": [
      "filebeat"
    ],
    "agent.ephemeral_id.keyword": [
      "e3425b80-edff-41c6-a6db-41d3e4904130"
    ],
    "remote_addr": [
      "202.105.107.186"
    ],
    "fields.env": [
      "test"
    ],
    "agent.name.keyword": [
      "hexo"
    ],
    "time_local.keyword": [
      "18/Aug/2021:11:47:13 +0800"
    ],
    "geoip.city_name": [
      "Shenzhen"
    ],
    "message": [
      "202.105.107.186 - - [18/Aug/2021:11:47:13 +0800] \"GET /ELK2/ HTTP/1.1\" 200 15101 \"http://rondochen.com/\" \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36\""
    ],
    "geoip.ip.keyword": [
      "202.105.107.186"
    ],
    "uri": [
      "/ELK2/"
    ],
    "@timestamp": [
      "2021-08-18T03:47:13.000Z"
    ],
    "body_sent_bytes.keyword": [
      "15101"
    ],
    "geoip.location.lon": [
      114.1333
    ],
    "response_code.keyword": [
      "200"
    ],
    "log.file.path": [
      "/var/log/nginx/hexo_access.log"
    ],
    "agent.ephemeral_id": [
      "e3425b80-edff-41c6-a6db-41d3e4904130"
    ],
    "geoip.timezone.keyword": [
      "Asia/Shanghai"
    ]
  },
  "highlight": {
    "agent.hostname.keyword": [
      "@kibana-highlighted-field@hexo@/kibana-highlighted-field@"
    ]
  },
  "sort": [
    1629258433000
  ]
  }

转载自: ELK专题:Day3——添加自定义字段并通过Geoip插件识别IP地址的地理位置