Elasticsearch:使用 BigQuery 提取数据

148 阅读10分钟

作者:来自 Elastic Jeffrey Rengifo

了解如何使用 Python 在 Elasticsearch 中索引和搜索 Google BigQuery 数据。

BigQuery 是 Google 的一个平台,允许你将来自不同来源和服务的数据集中到一个存储库中。它还支持数据分析,并可使用生成式 AI (GenAI) 和机器学习 (ML) 工具。以下是将数据引入 BigQuery 的方式:

将所有这些来源的数据索引到 Elasticsearch,可帮助你集中数据源,从而获得更好的可观测性体验。

在本文中,你将学习如何使用 Python 将 BigQuery 数据索引到 Elasticsearch,使你能够统一来自不同系统的数据,以便进行搜索和分析。

你可以在此 Google Colab 笔记本中使用本文提供的示例。

步骤

  1. 准备 BigQuery
  2. 配置 BigQuery Python 客户端
  3. 将数据索引到 Elasticsearch
  4. 搜索数据

准备 BigQuery

要使用 BigQuery,你需要访问 Google Cloud 控制台创建一个项目。完成后,你将被重定向到以下界面:

BigQuery 允许你从 Google Drive 和 Google Cloud Storage 传输数据,并支持上传本地文件。要将数据上传到 BigQuery,你首先需要创建一个数据集。创建一个并命名为 "server-logs",这样我们就可以上传一些文件。

在本文中,我们将上传一个包含不同类型文章的本地数据集。请查看 BigQuery 的官方文档,了解如何上传本地文件

数据集

我们将上传到 BigQuery 的文件包含服务器日志数据,其中包含 HTTP 响应及其描述,文件格式为 ndjson。该 ndjson 文件包含以下字段:ip_address_timestamphttp_methodendpointstatus_coderesponse_timestatus_code_description

BigQuery 将从该文件中提取数据,然后我们将使用 Python 进行整理,并将其索引到 Elasticsearch。

创建一个名为 logs.ndjson 的文件,并填充以下内容:



1.  {"ip_address": "192.168.1.3", "_timestamp": "2024-12-03T12:00:03Z", "http_method": "GET", "endpoint": "/about", "status_code": "404", "response_time": 89, "status_code_description": "The requested contact page does not exist or was removed."}
2.  {"ip_address": "192.168.1.3", "_timestamp": "2024-12-03T12:00:07Z", "http_method": "GET", "endpoint": "/contact", "status_code": "404", "response_time": 76, "status_code_description": "The requested contact page does not exist or was removed."}
3.  {"ip_address": "192.168.1.1", "_timestamp": "2024-12-03T12:00:01Z", "http_method": "GET", "endpoint": "/home", "status_code": "200", "response_time": 123, "status_code_description": "OK"}
4.  {"ip_address": "192.168.1.1", "_timestamp": "2024-12-03T12:00:04Z", "http_method": "GET", "endpoint": "/products", "status_code": "200", "response_time": 156, "status_code_description": "OK"}
5.  {"ip_address": "192.168.1.2", "_timestamp": "2024-12-03T12:00:05Z", "http_method": "GET", "endpoint": "/home", "status_code": "200", "response_time": 101, "status_code_description": "OK"}
6.  {"ip_address": "192.168.1.2", "_timestamp": "2024-12-03T12:00:08Z", "http_method": "GET", "endpoint": "/home", "status_code": "200", "response_time": 98, "status_code_description": "OK"}
7.  {"ip_address": "192.168.1.6", "_timestamp": "2024-12-03T12:00:10Z", "http_method": "GET", "endpoint": "/home", "status_code": "200", "response_time": 105, "status_code_description": "OK"}
8.  {"ip_address": "192.168.1.2", "_timestamp": "2024-12-03T12:00:02Z", "http_method": "POST", "endpoint": "/login", "status_code": "500", "response_time": 340, "status_code_description": "Internal error while processing the payment gateway."}
9.  {"ip_address": "192.168.1.5", "_timestamp": "2024-12-03T12:00:09Z", "http_method": "POST", "endpoint": "/payment", "status_code": "500", "response_time": 512, "status_code_description": "Internal error while processing the payment gateway."}
10.  {"ip_address": "192.168.1.4", "_timestamp": "2024-12-03T12:00:06Z", "http_method": "POST", "endpoint": "/checkout", "status_code": "503", "response_time": 450, "status_code_description": "Service temporarily unavailable during the checkout process."}


我们将此文件上传到刚刚创建的数据集(显示为 "server_logs"),并使用 "logs" 作为表名(显示为 "table id")。

完成后,你的文件应如下所示:

配置 BigQuery Python 客户端

下面,我们将学习如何使用 BigQuery Python 客户端Google Colab 来构建一个应用。

1. 安装依赖

首先,我们需要安装以下依赖项:

!pip install google-cloud-bigquery elasticsearch==8.16

google-cloud-bigquery 依赖项提供了访问 BigQuery 数据所需的工具,elasticsearch 允许连接到 Elastic 并索引数据,而 getpass 可用于输入敏感变量而不在代码中暴露它们。

让我们导入所有必要的依赖项:



1.  from elasticsearch import Elasticsearch, exceptions
2.  from google.cloud import bigquery
3.  from google.colab import auth
4.  from getpass import getpass
5.  from datetime import datetime
6.  import json


我们还需要声明其他变量,并初始化 Elasticsearch 的 Python 客户端:



1.  ELASTICSEARCH_ENDPOINT = getpass("Elasticsearch endpoint: ")
2.  ELASTIC_API_KEY = getpass("Elastic Api Key: ")
3.  # Google Cloud project name and BigQuery dataset name
4.  PROJECT_ID = "elasticsearch-bigquery"
5.  # dataset_id in format <your-project-name>.<your-dataset-name>
6.  DATASET_ID = f'{PROJECT_ID}.server-logs'
7.  # Elasticsearch client
8.  es_client = Elasticsearch(
9.      ELASTICSEARCH_ENDPOINT,
10.      api_key=ELASTIC_API_KEY,
11.  )


2. 身份验证

为了获取使用 BigQuery 所需的凭证,我们将使用认证。运行下面的命令行,并选择你用于创建 Google Cloud 项目的相同账户:

auth.authenticate_user()

现在,让我们查看 BigQuery 中的数据:



1.  client = bigquery.Client(project=PROJECT_ID)
2.  # Getting tables from dataset
3.  tables = client.list_tables(DATASET_ID)
4.  data = {}
5.  for table in tables:
6.      # Table id must be in format <dataset_name>.<table_name>
7.      table_id = f"{DATASET_ID}.{table.table_id}"
8.      print(f"Processing table: {table.table_id}")
9.      # Query to retrieve BigQuery tables data
10.      query = f"""
11.          SELECT *
12.          FROM `{table_id}`
13.      """
14.      query_job = client.query(query)
15.      results = query_job.result()
16.      print(f"Results for table: {table.table_id}:")
17.      data[table.table_id] = []
18.      for row in results:
19.          # Saving data with key=table_id
20.          data[table.table_id].append(dict(row))
21.          print(row)

23.  # variable with data
24.  logs_data = data['logs']


这应该是你看到的结果:



1.  Processing table: logs
2.  Results for table: logs:
3.  Row(('The requested contact page does not exist or was removed.', 404, 'GET', '/about', datetime.datetime(2024, 12, 3, 12, 0, 3, tzinfo=datetime.timezone.utc), 89, '192.168.1.3'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
4.  Row(('The requested contact page does not exist or was removed.', 404, 'GET', '/contact', datetime.datetime(2024, 12, 3, 12, 0, 7, tzinfo=datetime.timezone.utc), 76, '192.168.1.3'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
5.  Row(('OK', 200, 'GET', '/home', datetime.datetime(2024, 12, 3, 12, 0, 1, tzinfo=datetime.timezone.utc), 123, '192.168.1.1'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
6.  Row(('OK', 200, 'GET', '/products', datetime.datetime(2024, 12, 3, 12, 0, 4, tzinfo=datetime.timezone.utc), 156, '192.168.1.1'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
7.  Row(('OK', 200, 'GET', '/home', datetime.datetime(2024, 12, 3, 12, 0, 5, tzinfo=datetime.timezone.utc), 101, '192.168.1.2'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
8.  Row(('OK', 200, 'GET', '/home', datetime.datetime(2024, 12, 3, 12, 0, 8, tzinfo=datetime.timezone.utc), 98, '192.168.1.2'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
9.  Row(('OK', 200, 'GET', '/home', datetime.datetime(2024, 12, 3, 12, 0, 10, tzinfo=datetime.timezone.utc), 105, '192.168.1.6'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
10.  Row(('Internal error while processing the payment gateway.', 500, 'POST', '/login', datetime.datetime(2024, 12, 3, 12, 0, 2, tzinfo=datetime.timezone.utc), 340, '192.168.1.2'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
11.  Row(('Internal error while processing the payment gateway.', 500, 'POST', '/payment', datetime.datetime(2024, 12, 3, 12, 0, 9, tzinfo=datetime.timezone.utc), 512, '192.168.1.5'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})
12.  Row(('Service temporarily unavailable during the checkout process.', 503, 'POST', '/checkout', datetime.datetime(2024, 12, 3, 12, 0, 6, tzinfo=datetime.timezone.utc), 450, '192.168.1.4'), {'status_code_description': 0, 'status_code': 1, 'http_method': 2, 'endpoint': 3, '_timestamp': 4, 'response_time': 5, 'ip_address': 6})


通过这段简单的代码,我们已经从 BigQuery 提取了数据,并将其存储在 logs_data 变量中,现在可以将其与 Elasticsearch 一起使用。

将数据索引到 Elasticsearch

我们将首先在 Kibana Devtools 控制台中定义数据结构:



1.  es_client.indices.create(
2.          index="bigquery-logs",
3.          body={
4.              "mappings": {
5.                  "properties": {
6.                      "status_code_description": {"type": "match_only_text"},
7.                      "status_code": {"type": "keyword"},
8.                      "@timestamp": {"type": "date"},
9.                      "ip_address": {"type": "ip"},
10.                      "http_method": {"type": "keyword"},
11.                      "endpoint": {"type": "keyword"},
12.                      "response_time": {"type": "integer"},
13.                  }
14.              }
15.          }
16.      )


[match_only_text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#match-only-text-field-type "match_only_text") 字段是 [text](https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html#text-field-type "text") 字段类型的一种变体,通过不存储用于计算分数的元数据来节省磁盘空间。我们使用它,因为日志通常是时间为中心的,即日期比文本字段中的匹配质量更为重要。使用 textfield 的查询与使用 match_only_text 字段的查询是兼容的。

我们将使用 Elasticsearch _bulk API 来索引这些文件:

 1.   bulk_data = []

3.  for log_entry in logs_data:
4.      # Convert timestamp to ISO 8601 string
5.      timestamp_iso8601 = log_entry["_timestamp"].isoformat()

7.      # Prepare action metadata
8.      action_metadata = {
9.          "index": {
10.              "_index": "bigquery-logs",
11.              "_id": f"{log_entry['ip_address']}-{timestamp_iso8601}"
12.          }
13.      }
14.      # Prepare document
15.      document = {
16.          "ip_address": log_entry["ip_address"],
17.          "status_code": log_entry["status_code"],
18.          "@timestamp": timestamp_iso8601,
19.          "http_method": log_entry["http_method"],
20.          "endpoint": log_entry["endpoint"],
21.          "response_time": log_entry["response_time"],
22.          "status_code_description": log_entry["status_code_description"]
23.      }
24.      # Append to bulk data
25.      bulk_data.append(action_metadata)
26.      bulk_data.append(document)

28.  print(bulk_data)
29.  # Indexing data
30.  response = es_client.bulk(body=bulk_data)

搜索数据

现在,我们可以使用来自 bigquery-logs 索引的数据运行查询。

在这个例子中,我们将使用来自服务器的错误描述(status_code_description 字段)进行搜索。此外,我们将按日期对结果进行排序,并获取错误的 IP 地址:



1.  es_client.search(
2.      index="bigquery-logs",
3.      body={
4.          "query": {"match": {"status_code_description": "error"}},
5.          "sort": [{"@timestamp": {"order": "desc"}}],
6.          "aggs": {"by_ip": {"terms": {"field": "ip_address", "size": 10}}},
7.      },
8.  )


这是结果:



1.  {
2.    ...
3.    "hits": {
4.      ...
5.      "hits": [
6.        {
7.          "_index": "bigquery-logs",
8.          "_id": "192.168.1.5-2024-12-03T12:00:09+00:00",
9.          "_score": null,
10.          "_source": {
11.            "ip_address": "192.168.1.5",
12.            "status_code": 500,
13.            "@timestamp": "2024-12-03T12:00:09+00:00",
14.            "http_method": "POST",
15.            "endpoint": "/payment",
16.            "response_time": 512,
17.            "status_code_description": "Internal error while processing the payment gateway."
18.          },
19.          "sort": [
20.            1733227209000
21.          ]
22.        },
23.        {
24.          "_index": "bigquery-logs",
25.          "_id": "192.168.1.2-2024-12-03T12:00:02+00:00",
26.          "_score": null,
27.          "_source": {
28.            "ip_address": "192.168.1.2",
29.            "status_code": 500,
30.            "@timestamp": "2024-12-03T12:00:02+00:00",
31.            "http_method": "POST",
32.            "endpoint": "/login",
33.            "response_time": 340,
34.            "status_code_description": "Internal error while processing the payment gateway."
35.          },
36.          "sort": [
37.            1733227202000
38.          ]
39.        }
40.      ]
41.    },
42.    "aggregations": {
43.      "by_ip": {
44.        "doc_count_error_upper_bound": 0,
45.        "sum_other_doc_count": 0,
46.        "buckets": [
47.          {
48.            "key": "192.168.1.2",
49.            "doc_count": 1
50.          },
51.          {
52.            "key": "192.168.1.5",
53.            "doc_count": 1
54.          }
55.        ]
56.      }
57.    }
58.  }


结论

像 BigQuery 这样的工具有助于集中信息,对于数据管理非常有用。除了搜索,将 BigQuery 与 Elasticsearch 一起使用,可以利用机器学习和数据分析的强大功能,更简单、更快速地检测或分析问题。

想要获得 Elastic 认证吗?了解下一期 Elasticsearch 工程师培训的时间!

Elasticsearch 提供了许多新功能,帮助你为自己的使用案例构建最佳的搜索解决方案。深入了解我们的示例笔记本,了解更多内容,开始免费的云试用,或者现在就尝试在本地机器上使用 Elastic。

原文:Ingesting data with BigQuery - Elasticsearch Labs