借助python和d3做个简单的职业分类树形图

188 阅读5分钟

失业半年,本来想试试转行,奈何也不知道想干什么,就看了一下《职业分类大典》,于是乎就想办法把大典里的职业类目做成了json,然后用d3简单tree了一下,确实也是不知道怎么做这个图更好看,所以干脆就不研究了。

职业类目转json

图表的核心还是内容,这里吧说难不难,但是挺麻烦,本来想借助AI,把pdf喂给AI让它们帮忙生成一下json,有一个算一个,没一个表现合格的,看来AI还有很长的路要走啊。

这个pdf比较好找,找不到的就用这个链接就行: 中华人民共和国职业分类大典(2022版)

然后就是借助AI帮我写python代码了,主要涉及三步:

  1. 把pdf按照章节拆分,主要是因为目录和每个章节的首页会影响内容识别
  2. 根据字体、字号和文本的特点提取需要的文字信息,写入txt,这里调试和修改脚本用了比较长时间
  3. 将txt转换成json

主要都是借助AI的,所以也就不卖关子了,直接上脚本

PDF分割

这里需要装个依赖,然后我把pdf文件名改成了input.pdf

import PyPDF2

def split_pdf(input_path, output_path, start_page, end_page):
    """
    将 PDF 文件拆分成较小的 PDF 文件。

    参数:
    input_path (str): 输入的 PDF 文件路径。
    output_path (str): 输出的 PDF 文件路径。
    start_page (int): 拆分开始的页码(从0开始)。
    end_page (int): 拆分结束的页码(不包含该页)。
    """
    with open(input_path, 'rb') as pdf_file:
        reader = PyPDF2.PdfReader(pdf_file)
        writer = PyPDF2.PdfWriter()

        for i in range(start_page, end_page):
            page = reader.pages[i]
            writer.add_page(page)

        with open(output_path, 'wb') as output_pdf:
            writer.write(output_pdf)

split_pdf('input.pdf', 'output1.pdf', 9, 18)
split_pdf('input.pdf', 'output2.pdf', 21, 169)
split_pdf('input.pdf', 'output3.pdf', 171, 183)
split_pdf('input.pdf', 'output4.pdf', 185, 296)
split_pdf('input.pdf', 'output5.pdf', 299, 319)
split_pdf('input.pdf', 'output6.pdf', 321, 563)
split_pdf('input.pdf', 'output7.pdf', 565, 567)
split_pdf('input.pdf', 'output8.pdf', 569, 570)

文本提取

也需要装个依赖,这里的判断主要是因为pdf的页眉页脚也在识别范围内,需要规避掉,其余就是根据标题字体和字号大小来区分需要提取的内容了。

import fitz  # PyMuPDF

def extract_bold_text_from_pdf(pdf_path, output_path):
    # 打开PDF文件
    document = fitz.open(pdf_path)
    bold_texts = []

    # 遍历文档中的每一页
    for page_num in range(len(document)):
        page = document.load_page(page_num)
        blocks = page.get_text("dict")["blocks"]

        # 遍历页面中的所有文本块
        for block in blocks:
            spans_arr = [line["spans"] for line in block["lines"]]
            lines = []
            for spans in spans_arr:
                text_arr = [
                    span["text"]
                    for span in spans
                    if span["font"]
                    not in [
                        "FZNBSK--GBK1-0",
                        "E-F6X",
                        "E-BZ",
                        "Verdana",
                        "FZKTK--GBK1-0",
                        "E-B6",
                    ]
                    and ((span["size"] > 10 and span["size"] < 13) or span["size"] > 14)
                    and span["text"] not in ["."]
                ]
                lines.extend(text_arr)
            block_text = "".join(lines)
            if block_text:
                bold_texts.append(block_text)
    with open(output_path, "w") as f:
        for text in bold_texts:
            if text.endswith("M") or text.endswith(")"):
                f.write(text)
            else:
                f.write(text + "\n")



for i in range(1, 9):
    pdf_path = f"output{i}.pdf"  # 替换为你的PDF文件路径
    output_path = f'output{i}.txt'
    extract_bold_text_from_pdf(pdf_path, output_path)

txt转json

这里差了一步,我人肉把所有txt内容合并成一个txt文件了,其实写脚本也可以,但我感觉这个磨刀跟直接砍柴用的时间差不多。

import json
import re

def insert_node(structure, level, name):
    depth = len(level.split('-'))
    if depth == 1:
        structure.append({"id": level, "name": name, "children": []})
    else:
        parent = structure
        for _ in range(1, depth):
            parent = parent[-1]["children"]
        if depth == 4:
            parent.append({"id": level, "name": name})
        else:
            parent.append({"id": level, "name": name, "children": []})


def convert_to_json_structure(lines):
    structure = []
    for line in lines:
        print(line)
        matches = list(re.finditer(r"\d", line))
        idx = matches[-1].start()
        id = line[:idx + 1].strip()
        name = line[idx + 1 :].strip()
        insert_node(structure, id, name)
    return structure


# 读取文件
with open("output.txt", "r", encoding="utf-8") as file:
    lines = file.readlines()

# 转换结构
json_structure = convert_to_json_structure(lines)

# 转换为JSON
json_output = json.dumps(json_structure, ensure_ascii=False, indent=2)

# 打印或保存JSON输出
with open("output.json", "w", encoding="utf-8") as file:
    file.write(json_output)

图表绘制

直接vite起一个vanilla工程,装上d3,把做好的output.json放到public目录下, 然后去ObervableHQ直接把tree组件粘下来,放到main.js里,大功giao成,直接用下面现成的也行。

import './style.css';
import * as d3 from 'd3';

document.querySelector('#app').innerHTML = `
  <main></main> 
`;
const data = await d3.json('/output.json');
// Copyright 2021-2023 Observable, Inc.
// Released under the ISC license.
// https://observablehq.com/@d3/tree
function Tree(data, { // data is either tabular (array of objects) or hierarchy (nested objects)
  path, // as an alternative to id and parentId, returns an array identifier, imputing internal nodes
  id = Array.isArray(data) ? d => d.id : null, // if tabular data, given a d in data, returns a unique identifier (string)
  parentId = Array.isArray(data) ? d => d.parentId : null, // if tabular data, given a node d, returns its parent’s identifier
  children, // if hierarchical data, given a d in data, returns its children
  tree = d3.tree, // layout algorithm (typically d3.tree or d3.cluster)
  sort, // how to sort nodes prior to layout (e.g., (a, b) => d3.descending(a.height, b.height))
  label, // given a node d, returns the display name
  title, // given a node d, returns its hover text
  link, // given a node d, its link (if any)
  linkTarget = "_blank", // the target attribute for links (if any)
  width = 640, // outer width, in pixels
  height, // outer height, in pixels
  r = 3, // radius of nodes
  padding = 1, // horizontal padding for first and last column
  fill = "#999", // fill for nodes
  fillOpacity, // fill opacity for nodes
  stroke = "#555", // stroke for links
  strokeWidth = 1.5, // stroke width for links
  strokeOpacity = 0.4, // stroke opacity for links
  strokeLinejoin, // stroke line join for links
  strokeLinecap, // stroke line cap for links
  halo = "#fff", // color of label halo 
  haloWidth = 3, // padding around the labels
  curve = d3.curveBumpX, // curve for the link
} = {}) {

  // If id and parentId options are specified, or the path option, use d3.stratify
  // to convert tabular data to a hierarchy; otherwise we assume that the data is
  // specified as an object {children} with nested objects (a.k.a. the “flare.json”
  // format), and use d3.hierarchy.
  const root = path != null ? d3.stratify().path(path)(data)
    : id != null || parentId != null ? d3.stratify().id(id).parentId(parentId)(data)
      : d3.hierarchy(data, children);

  // Sort the nodes.
  if (sort != null) root.sort(sort);

  // Compute labels and titles.
  const descendants = root.descendants();
  const L = label == null ? null : descendants.map(d => label(d.data, d));

  // Compute the layout.
  const dx = 10;
  const dy = width / (root.height + padding);
  tree().nodeSize([dx, dy])(root);

  // Center the tree.
  let x0 = Infinity;
  let x1 = -x0;
  root.each(d => {
    if (d.x > x1) x1 = d.x;
    if (d.x < x0) x0 = d.x;
  });

  // Compute the default height.
  if (height === undefined) height = x1 - x0 + dx * 2;

  // Use the required curve
  if (typeof curve !== "function") throw new Error(`Unsupported curve`);

  const svg = d3.create("svg")
    .attr("viewBox", [-dy * padding / 2, x0 - dx, width, height])
    .attr("width", width)
    .attr("height", height)
    .attr("style", "max-width: 100%; height: auto; height: intrinsic;")
    .attr("font-family", "sans-serif")
    .attr("font-size", 10);

  svg.append("g")
    .attr("fill", "none")
    .attr("stroke", stroke)
    .attr("stroke-opacity", strokeOpacity)
    .attr("stroke-linecap", strokeLinecap)
    .attr("stroke-linejoin", strokeLinejoin)
    .attr("stroke-width", strokeWidth)
    .selectAll("path")
    .data(root.links())
    .join("path")
    .attr("d", d3.link(curve)
      .x(d => d.y)
      .y(d => d.x));

  const node = svg.append("g")
    .selectAll("a")
    .data(root.descendants())
    .join("a")
    .attr("xlink:href", link == null ? null : d => link(d.data, d))
    .attr("target", link == null ? null : linkTarget)
    .attr("transform", d => `translate(${d.y},${d.x})`);

  node.append("circle")
    .attr("fill", d => d.children ? stroke : fill)
    .attr("r", r);

  if (title != null) node.append("title")
    .text(d => title(d.data, d));

  if (L) node.append("text")
    .attr("dy", "0.32em")
    .attr("x", d => d.children ? -6 : 6)
    .attr("text-anchor", d => d.children ? "end" : "start")
    .attr("paint-order", "stroke")
    .attr("stroke", halo)
    .attr("stroke-width", haloWidth)
    .text((d, i) => L[i]);

  return svg.node();
}
const chart = Tree(data, {
  label: d => d.name,
  title: (d, n) => `${n.ancestors().reverse().map(d => d.data.name).join(".")}`, // hover text
  link: (d, n) => `${n.children ? "" : ".as"}`,
  width: 1152
});
document.querySelector('main').appendChild(chart);

图只能截一部分,太长了。 image.png 然而我好像也没发现有什么想做的,共勉吧,总得向前看,总得往前走。