手把手教你使用nosql在MongoDB中存储数据和访问数据Apache Web 服务器组合日志格式捕获 Web 服务器

在MongoDB中存储数据和访问数据

Apache Web 服务器组合日志格式捕获 Web 服务器的以下请求和响应属性：

IP address of the client （客户端的 IP 地址） - 如果客户端通过代理请求资源，则此值可以是代理的 IP 地址。
Identity of the client（客户的身份） - 通常这不是一条可靠的信息，通常不会被记录。
User name as identified during authentication（身份验证期间标识的用户名） - 如果不需要身份验证即可访问 Web 资源，则此值为空。
Time when the request was received（收到请求的时间） - 包括日期和时间以及时区。请求本身 - 这可以进一步细分为四个不同的部分：使用的方法、资源、请求参数和协议。
Status code（状态代码） - HTTP 状态代码。
Size of the object returned（返回对象的大小） - 大小为字节。
Referrer (跳转路径) - 通常是指链接到网页或资源的 URI 或 URL。
User-agent（用户代理）-客户端应用程序，通常是访问网页或资源的程序或设备。

日志文件本身是一个文本文件，用于将每个请求存储在单独的行中。要从文本文件中获取数据，您需要解析它并提取值。一个简单的基本 Python 程序来解析这个日志文件可以快速组合在一起，如示例所示。

import re

import fileinput

_lineRegex = re.compile(r'(\d+.\d+.\d+.\d+)([^]*)([^]*) [([^]]*)]"([^"]*)"(\d+)([^]*)"([^"]*)""([^"]*)"')

class ApacheLogRecord(object):

    def __init__(self, *rgroups):
        self.ip,self.ident,self.http_user,self.time,self.request_line, self.http_response_code,self.http_response_size, self.referrer, self.user_agent = rgroups
        self.http_method, self.url, self.http_vers = self.request_line.split()

    def __str__(self):

        return ' '.join([self.ip, self.ident,self.time, self.request_line, self.http_response_code,self.http_response_size,self.referrer, self.user_agent])

class ApacheLogFile(object):

    def __init__(self, *filename):

        self.f=fileinput.input(filename)

    def close(self):

        self.f.close()

    def __iter__(self):

        match = _lineRegex.match
        for line in self.f:
            m=match(line)
            if m:
                try:
                    log_line =ApacheLogRecord(*m.groups())
                    yield log_line
                except GeneratorExit:
                    pass
                except Exception as e:
                    print "NON_COMPLIANT_FORMAT: ", line, "Exception:",e

示例日志解析器是用Python编写的，使用PyMongo（Python MongoDB驱动程序）将数据写入MongoDB是最简单的，但是，在我了解使用PyMongo的细节之前，我建议稍微偏离MongoDB中数据存储的基本要素。

MongoDB是一个文档存储，只要可以使用类似JSON的对象Bierarchy表示，就可以持久化任意数据集合。它是一种快速、轻量级且流行的 Web 应用程序数据交换格式。为了呈现 JSON 格式的风格，从访问日志中提取的日志文件元素可以表示如下：

{
"ApacheLogRecord":{
    "ip": "127.0.0.1", "ident":"-",

    "http_user" : "frank",

    "time": "10/Oct/2000:13:55:36-0700", 
    
    "request_line":{

                    "http_method":"GET", 
        
                    "url":"/apache_pb.gif", 
                    
                    "http_vers":"HTTP/1.0", 
                
                    },

    "http_response_code":"200", 
    
    "http_response_size":"2326",

    "referrer":"http://www.example.com/start.html",

    "user_agent":"Mozilla/4.08 [en](Win98;I;Nav)",
    }
}

日志文件中的相应行如下：

MongoDB支持所有JSON数据类型，即字符串，整数，布尔值，双精度，空值，数组和对象。它还支持一些其他数据类型。其他数据类型包括日期、对象 ID、二进制数据、正则表达式和代码。Mongo支持这些额外的数据类型，因为它支持BSON，一种类似JSON结构的二进制编码序列化，而不仅仅是普通的普通JSON。

要将日志文件中该行的类似 JSON 的文档插入到名为 logdata 的集合中，您可以在 Mongo shell 中执行以下操作：

doc={

"ApacheLogRecord":{ "ip": "127.0.0.1", "ident":"-",

"http_user" : "frank",

"time":"10/0ct/2000:13:55:36-0700", "request_line":{

"http_method" : "GET","url":"/apache_pb.gif", "http_vers":"HTTP/1.0", },

"http_response_code":"200", "http_response_size" : "2326",

"referrer":"http://www.example.com/start.html", "user_agent" : "Mozilla/4.08 [en](Win98; I ;Nav)", },

};

db.logdata.insert(doc);

在Python示例中，您可以将字典中的数据（aliso在其他语言中称为map，hash map或关联数组）直接保存到MongoDB。这是因为PyMongo（驱动程序）负责将字典转换为BSON数据格式。要完成该示例，请创建一个实用程序函数，将对象的所有属性及其相应的值发布为字典，如下所示：

def props(obj):

    pr = {}
    
    for name in dir(obj):
    
        value = getattr(obj, name)
    
        if not name.startswith('__') and not inspect.ismethod(value):
            
            pr[name] = value
    
    return pr

此函数将request_line保存为单个元素。

您可能更愿意将其保存为三个单独的字段：HTTP 方法、URL 和 version，如示例所示。

您可能还希望创建一个嵌套的对象层次结构，有了这个功能，将数据存储到MongoDB只需要几行代码：

connection = Connection()

db=connection.mydb 

collection=db.logdata 

alf = ApacheLogFile(<path to access_log>) 

for log_line in alf:

    collection.insert(props(log_line)) 
    
alf.close()

这不是很简单吗？存储日志数据后，可以对其进行筛选和分析。

本文正在参加「技术专题19期漫谈数据库技术」活动