失业半年,本来想试试转行,奈何也不知道想干什么,就看了一下《职业分类大典》,于是乎就想办法把大典里的职业类目做成了json,然后用d3简单tree了一下,确实也是不知道怎么做这个图更好看,所以干脆就不研究了。
职业类目转json
图表的核心还是内容,这里吧说难不难,但是挺麻烦,本来想借助AI,把pdf喂给AI让它们帮忙生成一下json,有一个算一个,没一个表现合格的,看来AI还有很长的路要走啊。
这个pdf比较好找,找不到的就用这个链接就行: 中华人民共和国职业分类大典(2022版)
然后就是借助AI帮我写python代码了,主要涉及三步:
- 把pdf按照章节拆分,主要是因为目录和每个章节的首页会影响内容识别
- 根据字体、字号和文本的特点提取需要的文字信息,写入txt,这里调试和修改脚本用了比较长时间
- 将txt转换成json
主要都是借助AI的,所以也就不卖关子了,直接上脚本
PDF分割
这里需要装个依赖,然后我把pdf文件名改成了input.pdf
import PyPDF2
def split_pdf(input_path, output_path, start_page, end_page):
"""
将 PDF 文件拆分成较小的 PDF 文件。
参数:
input_path (str): 输入的 PDF 文件路径。
output_path (str): 输出的 PDF 文件路径。
start_page (int): 拆分开始的页码(从0开始)。
end_page (int): 拆分结束的页码(不包含该页)。
"""
with open(input_path, 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
writer = PyPDF2.PdfWriter()
for i in range(start_page, end_page):
page = reader.pages[i]
writer.add_page(page)
with open(output_path, 'wb') as output_pdf:
writer.write(output_pdf)
split_pdf('input.pdf', 'output1.pdf', 9, 18)
split_pdf('input.pdf', 'output2.pdf', 21, 169)
split_pdf('input.pdf', 'output3.pdf', 171, 183)
split_pdf('input.pdf', 'output4.pdf', 185, 296)
split_pdf('input.pdf', 'output5.pdf', 299, 319)
split_pdf('input.pdf', 'output6.pdf', 321, 563)
split_pdf('input.pdf', 'output7.pdf', 565, 567)
split_pdf('input.pdf', 'output8.pdf', 569, 570)
文本提取
也需要装个依赖,这里的判断主要是因为pdf的页眉页脚也在识别范围内,需要规避掉,其余就是根据标题字体和字号大小来区分需要提取的内容了。
import fitz # PyMuPDF
def extract_bold_text_from_pdf(pdf_path, output_path):
# 打开PDF文件
document = fitz.open(pdf_path)
bold_texts = []
# 遍历文档中的每一页
for page_num in range(len(document)):
page = document.load_page(page_num)
blocks = page.get_text("dict")["blocks"]
# 遍历页面中的所有文本块
for block in blocks:
spans_arr = [line["spans"] for line in block["lines"]]
lines = []
for spans in spans_arr:
text_arr = [
span["text"]
for span in spans
if span["font"]
not in [
"FZNBSK--GBK1-0",
"E-F6X",
"E-BZ",
"Verdana",
"FZKTK--GBK1-0",
"E-B6",
]
and ((span["size"] > 10 and span["size"] < 13) or span["size"] > 14)
and span["text"] not in ["."]
]
lines.extend(text_arr)
block_text = "".join(lines)
if block_text:
bold_texts.append(block_text)
with open(output_path, "w") as f:
for text in bold_texts:
if text.endswith("M") or text.endswith(")"):
f.write(text)
else:
f.write(text + "\n")
for i in range(1, 9):
pdf_path = f"output{i}.pdf" # 替换为你的PDF文件路径
output_path = f'output{i}.txt'
extract_bold_text_from_pdf(pdf_path, output_path)
txt转json
这里差了一步,我人肉把所有txt内容合并成一个txt文件了,其实写脚本也可以,但我感觉这个磨刀跟直接砍柴用的时间差不多。
import json
import re
def insert_node(structure, level, name):
depth = len(level.split('-'))
if depth == 1:
structure.append({"id": level, "name": name, "children": []})
else:
parent = structure
for _ in range(1, depth):
parent = parent[-1]["children"]
if depth == 4:
parent.append({"id": level, "name": name})
else:
parent.append({"id": level, "name": name, "children": []})
def convert_to_json_structure(lines):
structure = []
for line in lines:
print(line)
matches = list(re.finditer(r"\d", line))
idx = matches[-1].start()
id = line[:idx + 1].strip()
name = line[idx + 1 :].strip()
insert_node(structure, id, name)
return structure
# 读取文件
with open("output.txt", "r", encoding="utf-8") as file:
lines = file.readlines()
# 转换结构
json_structure = convert_to_json_structure(lines)
# 转换为JSON
json_output = json.dumps(json_structure, ensure_ascii=False, indent=2)
# 打印或保存JSON输出
with open("output.json", "w", encoding="utf-8") as file:
file.write(json_output)
图表绘制
直接vite起一个vanilla工程,装上d3,把做好的output.json
放到public
目录下,
然后去ObervableHQ直接把tree组件粘下来,放到main.js里,大功giao成,直接用下面现成的也行。
import './style.css';
import * as d3 from 'd3';
document.querySelector('#app').innerHTML = `
<main></main>
`;
const data = await d3.json('/output.json');
// Copyright 2021-2023 Observable, Inc.
// Released under the ISC license.
// https://observablehq.com/@d3/tree
function Tree(data, { // data is either tabular (array of objects) or hierarchy (nested objects)
path, // as an alternative to id and parentId, returns an array identifier, imputing internal nodes
id = Array.isArray(data) ? d => d.id : null, // if tabular data, given a d in data, returns a unique identifier (string)
parentId = Array.isArray(data) ? d => d.parentId : null, // if tabular data, given a node d, returns its parent’s identifier
children, // if hierarchical data, given a d in data, returns its children
tree = d3.tree, // layout algorithm (typically d3.tree or d3.cluster)
sort, // how to sort nodes prior to layout (e.g., (a, b) => d3.descending(a.height, b.height))
label, // given a node d, returns the display name
title, // given a node d, returns its hover text
link, // given a node d, its link (if any)
linkTarget = "_blank", // the target attribute for links (if any)
width = 640, // outer width, in pixels
height, // outer height, in pixels
r = 3, // radius of nodes
padding = 1, // horizontal padding for first and last column
fill = "#999", // fill for nodes
fillOpacity, // fill opacity for nodes
stroke = "#555", // stroke for links
strokeWidth = 1.5, // stroke width for links
strokeOpacity = 0.4, // stroke opacity for links
strokeLinejoin, // stroke line join for links
strokeLinecap, // stroke line cap for links
halo = "#fff", // color of label halo
haloWidth = 3, // padding around the labels
curve = d3.curveBumpX, // curve for the link
} = {}) {
// If id and parentId options are specified, or the path option, use d3.stratify
// to convert tabular data to a hierarchy; otherwise we assume that the data is
// specified as an object {children} with nested objects (a.k.a. the “flare.json”
// format), and use d3.hierarchy.
const root = path != null ? d3.stratify().path(path)(data)
: id != null || parentId != null ? d3.stratify().id(id).parentId(parentId)(data)
: d3.hierarchy(data, children);
// Sort the nodes.
if (sort != null) root.sort(sort);
// Compute labels and titles.
const descendants = root.descendants();
const L = label == null ? null : descendants.map(d => label(d.data, d));
// Compute the layout.
const dx = 10;
const dy = width / (root.height + padding);
tree().nodeSize([dx, dy])(root);
// Center the tree.
let x0 = Infinity;
let x1 = -x0;
root.each(d => {
if (d.x > x1) x1 = d.x;
if (d.x < x0) x0 = d.x;
});
// Compute the default height.
if (height === undefined) height = x1 - x0 + dx * 2;
// Use the required curve
if (typeof curve !== "function") throw new Error(`Unsupported curve`);
const svg = d3.create("svg")
.attr("viewBox", [-dy * padding / 2, x0 - dx, width, height])
.attr("width", width)
.attr("height", height)
.attr("style", "max-width: 100%; height: auto; height: intrinsic;")
.attr("font-family", "sans-serif")
.attr("font-size", 10);
svg.append("g")
.attr("fill", "none")
.attr("stroke", stroke)
.attr("stroke-opacity", strokeOpacity)
.attr("stroke-linecap", strokeLinecap)
.attr("stroke-linejoin", strokeLinejoin)
.attr("stroke-width", strokeWidth)
.selectAll("path")
.data(root.links())
.join("path")
.attr("d", d3.link(curve)
.x(d => d.y)
.y(d => d.x));
const node = svg.append("g")
.selectAll("a")
.data(root.descendants())
.join("a")
.attr("xlink:href", link == null ? null : d => link(d.data, d))
.attr("target", link == null ? null : linkTarget)
.attr("transform", d => `translate(${d.y},${d.x})`);
node.append("circle")
.attr("fill", d => d.children ? stroke : fill)
.attr("r", r);
if (title != null) node.append("title")
.text(d => title(d.data, d));
if (L) node.append("text")
.attr("dy", "0.32em")
.attr("x", d => d.children ? -6 : 6)
.attr("text-anchor", d => d.children ? "end" : "start")
.attr("paint-order", "stroke")
.attr("stroke", halo)
.attr("stroke-width", haloWidth)
.text((d, i) => L[i]);
return svg.node();
}
const chart = Tree(data, {
label: d => d.name,
title: (d, n) => `${n.ancestors().reverse().map(d => d.data.name).join(".")}`, // hover text
link: (d, n) => `${n.children ? "" : ".as"}`,
width: 1152
});
document.querySelector('main').appendChild(chart);
图只能截一部分,太长了。
然而我好像也没发现有什么想做的,共勉吧,总得向前看,总得往前走。