Day11: 灵魂注入 —— 彻底废弃 Mock,接入真实大模型 API

12 阅读6分钟

🚀 Day 11: 灵魂注入 —— 彻底废弃 Mock,接入真实大模型 API

今日目标:将 Day 4 配置的全局 API 凭证、Day 6 的真实日志抓取逻辑,以及 Day 10 的执行引擎完美融合。我们将引入 requests 库,真正打通 Splunk 到外部大模型的网络链路,实现全动态的 “真实异常发现 -> AI 现场出谋划策 -> 自动化执行下钻 -> AI 结案定性” 的终极闭环!


💻 架构大纲:今天我们将发生哪些巨变?

  1. 时区防弹修复 (The Timezone Fix):彻底摒弃无时区标识的 utcnow(),引入带有强制时区偏移(+00:00)的 datetime.now(datetime.timezone.utc),确保 Splunk 无论在哪个大洲都能精准对齐时间。
  2. 凭证动态提取:从 Splunk 底层保险箱中提取 API KeyBase URL
  3. 实战探针注入:编写 Python 函数,去你的系统里动态抓取最新的 5 条生僻日志(M-ATH 聚类算法)。
  4. 网络通信封装:编写 call_llm_api 函数,封装 HTTP 请求,真正对接云端。
  5. 两次灵魂唤醒:在 Prepare 阶段(写狩猎蓝图)和 Act 阶段(下发定性战报)真实调用大模型。

💻 终极实战:Day 11 全量代码基线

为了避免代码变成一坨“意大利面”,我们采用了模块化架构,增加了两个极其核心的辅助函数:fetch_rare_logscall_llm_api

请打开 Add-on Builder 的 Define & Test 编辑器,清空原有代码,直接粘贴以下全量代码

import os
import sys
import time
import datetime
import json
import uuid
import requests
import splunklib.client as client
import splunklib.results as results

# ==========================================
# HELPER 1: Execute AI Generated SPL
# ==========================================
def execute_ai_spl(helper, service, spl_query):
    """
    Execute SPL generated by AI and return the raw result data.
    """
    spl_query = spl_query.strip()
    if not spl_query.startswith("search") and not spl_query.startswith("|"):
        spl_query = "search " + spl_query
        
    kwargs_oneshot = {"output_mode": "json"}
    helper.log_info(f"[Agentic Engine] Executing SPL: {spl_query}")
    
    try:
        search_results = service.jobs.oneshot(spl_query, **kwargs_oneshot)
        reader = results.JSONResultsReader(search_results)
        result_data = [res for res in reader if isinstance(res, dict)]
        helper.log_info(f"[Agentic Engine] SUCCESS: Found {len(result_data)} events.")
        return result_data
    except Exception as e:
        helper.log_error(f"[Agentic Engine] FAILED execution: {str(e)}")
        return []

# ==========================================
# HELPER 2: Fetch Real Logs (M-ATH Concept)
# ==========================================
def fetch_rare_logs(helper, service, target_index):
    """
    Fetch the most recent rare/anomalous logs from the target index to feed the AI.
    """
    helper.log_info("Fetching real rare logs for analysis...")
    # Using a simple SPL to grab actual data. 
    # Note: If 'cluster' consumes too much CPU, simplify to 'head 5'
    spl = f"search index={target_index} | head 1000 | cluster showcount=t | sort count | head 5 | table _raw"
    
    try:
        results_data = execute_ai_spl(helper, service, spl)
        if not results_data:
            return None
        
        # Extract the _raw strings and join them into a single payload
        raw_logs = [item.get("_raw", "") for item in results_data if "_raw" in item]
        return "\n".join(raw_logs)
    except Exception as e:
        helper.log_error(f"Failed to fetch rare logs: {str(e)}")
        return None

# ==========================================
# HELPER 3: The LLM API Connector
# ==========================================
def call_llm_api(helper, api_key, base_url, model, system_prompt, user_prompt):
    """
    Establish real HTTP connection to the LLM API and return the JSON response.
    """
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }
    
    payload = {
        "model": model,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        # Force JSON output (API requires the word 'JSON' in the prompt)
        "response_format": {"type": "json_object"} 
    }
    
    # Ensure URL ends correctly
    endpoint = base_url if base_url.endswith("/chat/completions") else f"{base_url.rstrip('/')}/chat/completions"
    
    try:
        helper.log_info(f"Initiating network request to LLM API: {endpoint}")
        # Increased timeout to 120s for complex reasoning models
        response = requests.post(endpoint, headers=headers, json=payload, timeout=120)
        response.raise_for_status() 
        
        response_json = response.json()
        llm_content = response_json["choices"][0]["message"]["content"]
        
        # Log token usage for FinOps tracking
        total_tokens = response_json.get("usage", {}).get("total_tokens", 0)
        helper.log_info(f"API Call Success. Consumed {total_tokens} tokens.")
        
        return llm_content, total_tokens
        
    except requests.exceptions.RequestException as e:
        helper.log_error(f"Network error during API call: {str(e)}")
        raise

# ==========================================
# MAIN WORKFLOW: The Autonomous Agent
# ==========================================
def collect_events(helper, ew):
    """
    Day 11: Real API Integration Workflow with Timezone Fix.
    """
    helper.log_info("PEAK AI Hunter: LIVE MODE INITIALIZED.")
    cycle_start_time = time.time()
    hunt_session_id = str(uuid.uuid4())

    try:
        # 1. Acquire Splunk Service Session
        session_key = getattr(helper, 'session_key', None) or getattr(helper._input_definition, 'metadata', {}).get('session_key')
        if not session_key:
            raise ValueError("Failed to acquire session_key.")
        service = client.Service(token=session_key)
        
        # 2. Acquire Global Setup Configurations (from Day 4)
        api_key = helper.get_global_setting("api_key")
        base_url = helper.get_global_setting("base_url")
        model_name = helper.get_global_setting("model_name")
        target_index = helper.get_output_index() or "main"
        
        # THE FIX: Generate timezone-aware UTC timestamp (e.g., 2026-05-02T12:05:18.142200+00:00)
        timestamp_now = datetime.datetime.now(datetime.timezone.utc).isoformat()

        if not api_key or not base_url:
             raise ValueError("API Key or Base URL is missing in Global Settings.")

        # ==========================================
        # PHASE 1: PREPARE (Real LLM Call for Blueprint)
        # ==========================================
        rare_logs_payload = fetch_rare_logs(helper, service, target_index)
        if not rare_logs_payload:
            helper.log_info("No anomalous logs found to analyze. Terminating cycle early gracefully.")
            return

        sys_prompt_prepare = "You are a Senior Threat Hunter. You MUST reply in JSON format. Output strictly valid JSON. Schema requires: 'analysis' (string) and 'hypotheses' (array of objects). Each hypothesis must have 'hypothesis_id', 'ABLE' (Actor, Behavior, Location, Evidence), 'spl_round_1_validation', and 'spl_round_2_drilldown'."
        usr_prompt_prepare = f"Analyze these real, rare logs from our environment:\n{rare_logs_payload}\n\nGenerate exactly 2 hunting hypotheses to investigate them. Write efficient Splunk SPL for the drill-downs. Output only JSON format."

        helper.log_info("Triggering LLM for Prepare Phase...")
        blueprint_text, prep_tokens = call_llm_api(helper, api_key, base_url, model_name, sys_prompt_prepare, usr_prompt_prepare)
        
        ai_hunting_plan = json.loads(blueprint_text.strip())
        hypotheses = ai_hunting_plan.get("hypotheses", [])

        # Write Plan to Splunk IMMEDIATELY
        ew.write_event(helper.new_event(
            source=helper.get_input_type(), index=target_index, sourcetype="_json",
            data=json.dumps({"session_id": hunt_session_id, "event_type": "PEAK_Plan", "timestamp": timestamp_now, "content": ai_hunting_plan}, ensure_ascii=False)
        ))

        # ==========================================
        # PHASE 2: EXECUTE (Agentic Splunk Query Loop)
        # ==========================================
        all_hunt_evidence = []
        for i, hyp in enumerate(hypotheses):
            hyp_start = time.time()
            spl_r1 = hyp.get("spl_round_1_validation", "").replace("{target_index}", target_index)
            spl_r2 = hyp.get("spl_round_2_drilldown", "").replace("{target_index}", target_index)
            
            r1_hits = len(execute_ai_spl(helper, service, spl_r1))
            r2_hits = len(execute_ai_spl(helper, service, spl_r2))
            
            all_hunt_evidence.append({
                "hypothesis_id": hyp.get("hypothesis_id", i+1),
                "threat_behavior": hyp.get('ABLE', {}).get('Behavior', 'Unknown'),
                "round_1_hit_count": r1_hits,
                "round_2_hit_count": r2_hits,
                "execution_duration_sec": round(time.time() - hyp_start, 2)
            })

        # Write Evidence to Splunk IMMEDIATELY
        ew.write_event(helper.new_event(
            source=helper.get_input_type(), index=target_index, sourcetype="_json",
            data=json.dumps({"session_id": hunt_session_id, "event_type": "PEAK_Evidence", "timestamp": timestamp_now, "content": all_hunt_evidence}, ensure_ascii=False)
        ))

        # ==========================================
        # PHASE 3: ACT (Real LLM Call for Final Report)
        # ==========================================
        sys_prompt_act = "You are a Security Director. You MUST reply in JSON format. Output ONLY valid JSON with keys: 'executive_summary', 'threat_qualification' (Benign/Suspicious/Confirmed), 'risk_score' (0-100), 'recommended_alert_spl'."
        usr_prompt_act = f"Here is the quantitative execution evidence collected by our agent:\n{json.dumps(all_hunt_evidence)}\n\nBased on these hit counts, qualify the threat, assign a risk score, and generate an alert SPL. Reply in JSON format."

        helper.log_info("Triggering LLM for Act Phase...")
        report_text, act_tokens = call_llm_api(helper, api_key, base_url, model_name, sys_prompt_act, usr_prompt_act)
        
        try:
            final_report = json.loads(report_text.strip())
        except json.JSONDecodeError as e:
            helper.log_error("JSON Truncation in Act Phase. Engaging fallback.")
            final_report = {"executive_summary": "LLM output truncated.", "risk_score": -1, "raw": report_text}

        # Write Final Report to Splunk
        ew.write_event(helper.new_event(
            source=helper.get_input_type(), index=target_index, sourcetype="_json",
            data=json.dumps({"session_id": hunt_session_id, "event_type": "PEAK_Final_Report", "timestamp": timestamp_now, "total_tokens_used": prep_tokens + act_tokens, "content": final_report}, ensure_ascii=False)
        ))

        helper.log_info(f"LIVE CYCLE COMPLETE. Time: {round(time.time() - cycle_start_time, 2)}s. Session ID: {hunt_session_id}")

    except Exception as e:
        helper.log_error(f"FATAL Pipeline Crash: {str(e)}")

🔍 极客验证:见证奇迹的时刻

由于我们已经接入了真实的云端 API,并且彻底扫清了外围障碍(timeout 延长到了 120秒,API JSON 强校验也补齐了),修复了最核心的时区偏差,这次测试将会极其丝滑!

操作步骤:

  1. 测试前置确认:确保你在 Day 4 的 AOB 界面(Configuration -> Add-on Setup Parameters)中,已经真实填写了你的 API Key、Base URL(例如 api.openai.com/v1 或各大厂商兼容的地址)以及模型名称(例如 gpt-4o 或 qwen-plus)。
  2. 执行测试:在代码编辑器点击右上角的 Test。这一次,你会发现等待时间变长了(可能需要 15-30 秒),因为代码正在与云端的大模型进行真实的 HTTP 通信!
  3. 保存结果:在测试完成后,看到绿色的 Done 后,立刻点击 Save
  4. 查收战果:打开 Splunk 的 Search 界面,执行下面这段查询,时间范围大胆选择 Last 15 minutes
index=main sourcetype="_json" event_type="PEAK_Plan" OR event_type="PEAK_Evidence" OR event_type="PEAK_Final_Report"
| stats 
    latest(content.risk_score) as Risk_Score,
    latest(content.executive_summary) as Summary,
    sum(content{}.round_1_hit_count) as Total_R1_Hits,
    sum(content{}.round_2_hit_count) as Total_R2_Hits
  by session_id
| sort - Risk_Score

🎉 终极实战里程碑: 如果查询有结果,并且摘要里是对你本机真实日志的分析,这就意味着:你亲手打造的安全 AI 智能体,已经拥有了自己独立思考、下发指令、自动取证的能力!

现在,唯一的问题是:如果今天系统里产生的异常日志多达 10 万字,塞给大模型时会发生什么? 这正是明天的 Day 12:动态上下文提纯与防护 要解决的终极难题!感受真正的企业级大风大浪吧!