【OpenHarmony】鸿蒙开发之jsoup

10 阅读7分钟

简介

快速且宽容的HTML解析器

  • 从URL、文件或字符串中抓取和解析HTML;
  • 将HTML文档转化为DOM结构,可以从元素中提取属性、文本;
  • 操作HTML元素、属性和文本;
  • 清理用户提交的HTML,在每个元素的基础上保留用户列入白名单的元素和列入白名单的属性;
  • 输出整洁的HTML或者XHTML。

下载安装

按功能对应下载安装:

场景一:HTML操作:对HTML文档进行解析、提取、清理

ohpm install @ohos/sanitize-html 

场景二:HTML转化为整洁的XHTML

ohpm install @ohos/htmltoxml

场景三:HTML转化为json

ohpm install parser-html-json

使用说明

HTML操作

解析HTML并提取元素中的属性、文本

  • 在src/main/ets/entryability/EntryAbility.ts中配置GlobalContext
  GlobalContext.getContext().setValue("resManager", this.context.resourceManager);
  GlobalContext.getContext().setValue("filesPath", this.context.filesDir);
  GlobalContext.getContext().setValue("context", this.context);
  • 创建Partial,(helper.ts)
import type { Parser } from "htmlparser2";
import { Handler } from 'htmlparser2/src/main/ets/esm/Parser';

interface Event {
    $event: string;
    data: unknown[];
    startIndex: number;
    endIndex: number;
}

/**
 * Creates a handler that calls the supplied callback with simplified events on
 * completion.
 *
 * @internal
 * @param callback Function to call with all events.
 */
export function getEventCollector(
    callback: (error: Error | null, data?: ESObject) => void,
): Partial<Handler> {
    const events: Event[] = [];
    let parser: Parser;

    function handle(event: string, data: unknown[]): void {
        switch (event) {
            case "onerror": {
                callback(data[0] as Error);
                break;
            }
            case "onend": {
                callback(null, {
                    $event: event.slice(2),
                    startIndex: parser.startIndex,
                    endIndex: parser.endIndex,
                    data,
                });
                break;
            }
            case "onreset": {
                events.length = 0;
                break;
            }
            case "onparserinit": {
                parser = data[0] as Parser;
                break;
            }

            case "onopentag": {
                callback(null, {
                    $event: event.slice(2),
                    startIndex: parser.startIndex,
                    endIndex: parser.endIndex,
                    data,
                });
                break;
            }

            case "ontext": {
                callback(null, {
                    $event: event.slice(2),
                    startIndex: parser.startIndex,
                    endIndex: parser.endIndex,
                    data: data[0],
                })
                break;
            }

            case "onclosetag": {
                if (data[0] === "script") {
                    console.info("htmlparser2--That's it?!");
                }
                break;
            }
            default: {
                const last = events[events.length - 1];
                if (event === "ontext" && last && last.$event === "text") {
                    (last.data[0] as string) += data[0];
                    last.endIndex = parser.endIndex;
                    break;
                }

                if (event === "onattribute" && data[2] === undefined) {
                    data.pop();
                }

                if (!(parser.startIndex <= parser.endIndex)) {
                    throw new Error(
                        `Invalid start/end index ${parser.startIndex} > ${parser.endIndex}`,
                    );
                }

                events.push({
                    $event: event.slice(2),
                    startIndex: parser.startIndex,
                    endIndex: parser.endIndex,
                    data,
                });
                parser.endIndex;
            }
        }
    }

    return new Proxy(
        {},
        {
            get:
            (_, event: string) =>
            (...data: unknown[]) =>
            handle(event, data),
        },
    );
}
  • 使用Handler构建Parser
import { Parser } from 'htmlparser2'

let  parser = new Parser(helper.getEventCollector((error, actual: ESObject) => {
    if (actual.$event == "opentag") {
        this.addLog(this.parserContent, `jsoup-- onopentag name --> ${actual.data[0]}  attributes --> ${JSON.stringify(actual.data[1])}`);
    }
    if (actual.$event == "text") {
        this.addLog(this.parserContent, "jsoup-- text -->" + actual.data);
    }
    if (actual.$event == "opentagname") {
        this.addLog(this.parserContent, "jsoup-- tagName -->" + actual.data);
    }
    if (actual.$event == "attribute") {
        this.addLog(this.parserContent, `jsoup-- attribName name --> ${actual.data[0]}  value --> ${actual.data[1]}`);
    }
    if (actual.$event == "closetag") {
        this.addLog(this.parserContent, "jsoup-- closeTag --> " + actual.data);
    }
    if (actual.$event == "end") {
        this.showResult(this.parserContent.join('\n'))
        this.parserContent = [];
    }
}));
parser.write(html);
parser.end();

DD一下:欢迎大家关注工粽号<程序猿百晓生>,可以了解到以下知识点。

`欢迎大家关注工粽号<程序猿百晓生>,可以了解到以下知识点学习。`
1.OpenHarmony开发基础
2.OpenHarmony北向开发环境搭建
3.鸿蒙南向开发环境的搭建
4.鸿蒙生态应用开发白皮书V2.0 & V3.0
5.鸿蒙开发面试真题(含参考答案) 
6.TypeScript入门学习手册
7.OpenHarmony 经典面试题(含参考答案)
8.OpenHarmony设备开发入门【最新版】
9.沉浸式剖析OpenHarmony源代码
10.系统定制指南
11.【OpenHarmony】Uboot 驱动加载流程
12.OpenHarmony构建系统--GN与子系统、部件、模块详解
13.ohos开机init启动流程
14.鸿蒙版性能优化指南
.......
  • 使用DomHandler构建Parser
import { Parser } from 'htmlparser2'
import { DomHandler } from 'domhandler'
import * as DomUtils from 'domutils'

const handler = new DomHandler((error, dom) => {
  if (error) {
    // Handle error
  } else {
    // Parsing completed, do something
    console.info('jsoup dom.toString()=' + dom + "");
    let elements = DomUtils.getElementsByTagName('style', dom)
    console.info('jsoup elements.length=', elements.length);
    let element = elements[0]
    console.info('jsoup element=', Object.keys(element));
    let text = DomUtils.getText(elements)
    console.info('jsoup text=', text); 
  }
});
const parser = new Parser(handler, { decodeEntities: true });
parser.write(html);
parser.end();

  • parseDocument解析
import { parseDocument } from 'htmlparser2'
import * as DomUtils from 'domutils'

let dom: Document = parseDocument(html)
// 通过DomUtils对解析过的Dom对象进行操作
// 根据标签名称获取元素
let element = DomUtils.getElementsByTagName('style', dom)
// 获取文本
let text = DomUtils.getText(element)
// 判断元素类型是否为tag
let isTag = DomUtils.isTag(element[0])
// 判断元素类型是否为CDATA
let isCDATA = DomUtils.isCDATA(element[0])
// 判断元素类型是否Text
let isText = DomUtils.isText(element[0])
// 判断元素类型是否为Comment
let isComment = DomUtils.isComment(element[0])
// 获取指定元素的子元素集
let childrens = DomUtils.getChildren(body[0])

获取HTML文本

  • 通过URL获取HTML文本
import http from '@ohos.net.http';

let httpRequest = http.createHttp()
httpRequest.request('http://106.15.92.248/share/html.txt')
  .then((data) => {
    console.log("jsoup url html=" + JSON.stringify(data))
   	// TODO do something
    if (data.result && typeof data.result === 'string') {
      parser.write(data.result);
      parser.end();
    }
  })
  .catch((err) => {
    console.error('jsoup connect error:' + JSON.stringify(err));
  })

  • 通过文件流获取HTML文本
import fileio from '@ohos.fileio';

let buf = new ArrayBuffer(html.length)
stream.readSync(buf, {
  offset: 0, length: html.length, position: 0
})
let dom = String.fromCharCode.apply(null, new Uint8Array(buf))
// TODO  do something
parser.write(dom);
parser.end();

  • 通过rawfile获取HTML文本
import util from '@ohos.util';

// 注意:需要先在MainAbility中为该变量赋值:
let resourceManager=GlobalContext.getContext().getValue("resManager") as resmgr.ResourceManager
if (!resourceManager ) {
  console.log('jsoup resourceManager is undefined');
  return;
}
resourceManager.getRawFile(filePath)
  .then((data) => {
    var textDecoder = new util.TextDecoder("utf-8", {
      ignoreBOM: true
    })
    var result: string = textDecoder.decode(data, {
      stream: false
    })
    // TODO do something
    parser.write(result);
    parser.end();
  })
  .catch((err) => {
    console.log("jsoup getHtmlFromRawFile err=" + err)
  })
  • 通过文件路径获取HTML文本
 import fileio from '@ohos.fileio';

 let filesPath = GlobalContext.getContext()
            .getValue("filesPath") as string

 if (!filesPath) {
   console.log('jsoup filesPath is undefined');
   return;
 }
 var filePath = filesPath + '/jsoup.html';
 fileio.readText(filePath)
   .then((data) => {
     console.log("jsoup getHtmlFromFilePath text=" + data);
     // TODO do something
     parser.write(data);
     parser.end();
   })
   .catch((err) => {
     console.log("jsoup getHtmlFromFilePath err=" + err)
   })

清理HTML并且可以操作HTML元素、属性和文本

  • 导入模块
import SanitizeHtml from 'sanitize-html'
  • 清理HTML

使用默认的标签和属性列表:

const clean = SanitizeHtml(dirty);

允许的特定的标签和属性不会被清除:

const clean = sanitizeHtml(dirty, {
 allowedTags: [ 'b', 'i', 'em', 'strong', 'a' ],
 allowedAttributes: {
   'a': [ 'href' ]
 },
 allowedIframeHostnames: ['www.youtube.com']
});

在默认列表的基础上添加标签:

const clean = SanitizeHtml(dirty, {
  allowedTags: SanitizeHtml.defaults.allowedTags.concat([ 'img' ])
});

将不允许的标签进行转义,而不是清除:

const clean = SanitizeHtml('before <img src="test.png" /> after', {
 disallowedTagsMode: 'escape',
 allowedTags: [],
 allowedAttributes: false
})

允许所有标签或所有属性:

allowedTags: false,
allowedAttributes: false

不想允许任何标签:

allowedTags: [],
allowedAttributes: {}

在特定元素上允许特定的CSS类:

const clean = SanitizeHtml(dirty, {
allowedTags: [ 'p', 'em', 'strong' ],
allowedClasses: {
'p': [ 'fancy', 'simple' ]
}
});

在特定元素上允许特定的CSS样式

const clean = SanitizeHtml(dirty, {
        allowedTags: ['p'],
        allowedAttributes: {
          'p': ["style"],
        },
        allowedStyles: {
          '*': {
            // Match HEX and RGB
            'color': [/^#(0x)?[0-9a-f]+$/i, /^rgb\(\s*(\d{1,3})\s*,\s*(\d{1,3})\s*,\s*(\d{1,3})\s*\)$/],
            'text-align': [/^left$/, /^right$/, /^center$/],
            // Match any number with px, em, or %
            'font-size': [/^\d+(?:px|em|%)$/]
          },
          'p': {
            'font-size': [/^\d+rem$/]
          }
        }
      });
  • 更改标签
const dirty='<ol><li>Hello world</li></ol>';
const clean = SanitizeHtml(dirty, {
 transformTags: {
   'ol': 'ul',
 }
});

更改标签并且添加属性:

const dirty = '<ol foo="foo" bar="bar" baz="baz"><li>Hello world</li></ol>';
const clean = SanitizeHtml(dirty, {
              transformTags: { ol: SanitizeHtml.simpleTransform('ul', { class: 'foo' }) },
              allowedAttributes: { ul: ['foo', 'bar', 'class'] }
            });
  • 可以添加或修改标签的文本内容
const clean = SanitizeHtml(dirty, {
  transformTags: {
    'a': function(tagName, attribs) {
      return {
        tagName: 'a',
        attribs: attribs,
        text: 'Some text'
      };
    }
  }
});

例如,您可以转换缺少锚文本的链接元素:

<a href="http://somelink.com"></a>

到带有锚文本的链接:

<a href="http://somelink.com">Some text</a>
  • 提供过滤功能来删除不需要的标签
const dirty = '<p>This is <a href="http://www.linux.org"></a><br/>Linux</p>';
const clean = SanitizeHtml(dirty, {
              exclusiveFilter: function (frame) {
                return frame.tag === 'a' && !frame.text.trim();
              }
            });

DD一下:欢迎大家关注工粽号<程序猿百晓生>,可以了解到以下知识点。

`欢迎大家关注工粽号<程序猿百晓生>,可以了解到以下知识点学习。`
1.OpenHarmony开发基础
2.OpenHarmony北向开发环境搭建
3.鸿蒙南向开发环境的搭建
4.鸿蒙生态应用开发白皮书V2.0 & V3.0
5.鸿蒙开发面试真题(含参考答案) 
6.TypeScript入门学习手册
7.OpenHarmony 经典面试题(含参考答案)
8.OpenHarmony设备开发入门【最新版】
9.沉浸式剖析OpenHarmony源代码
10.系统定制指南
11.【OpenHarmony】Uboot 驱动加载流程
12.OpenHarmony构建系统--GN与子系统、部件、模块详解
13.ohos开机init启动流程
14.鸿蒙版性能优化指南
.......

HTML转化为整洁的XHTML

import { XMLWriter } from '@ohos/htmltoxml'
let property = [{ key: XMLWriter.DOCTYPE_PUBLIC, value: '-//W3C//DTD XHTML 1.1//EN' },
{ key: XMLWriter.DOCTYPE_SYSTEM, value: 'http://www.w3.org/TR?xhtml11/DTD/xhtml11.dtd' }]
const xml = new XMLWriter(html, property);
xml.convertToXML((content, error) => {

})

提取CSS

import * as ParserHTMLJson from 'parser-html-json'

let parserJson = new ParserHTMLJson.default(html);
let result = parserJson.getClassStyleJson();
console.info("jsoup css=" + JSON.stringify(result));

接口说明

类型定义:

 // 解析器处理回调
 interface Handler {
   onparserinit(parser: Parser): void;
   onreset(): void;
   onend(): void;
   onerror(error: Error): void;
   onclosetag(name: string): void;
   onopentagname(name: string): void;
   onattribute(name: string, value: string, quote?: string | undefined | null): void;
   onopentag(name: string, attribs: {
       [s: string]: string;
   }): void;
   ontext(data: string): void;
   oncomment(data: string): void;
   oncdatastart(): void;
   oncdataend(): void;
   oncommentend(): void;
   onprocessinginstruction(name: string, data: string): void;
 }

// 解析器选项
 interface ParserOptions {
   decodeEntities?: boolean;
   lowerCaseTags?: boolean;
   lowerCaseAttributeNames?: boolean;
   recognizeCDATA?: boolean;
 }

 // 清理HTML,抵御XSS攻击
 declare namespace sanitize {
  interface Attributes { [attr: string]: string; }

  interface Tag { tagName: string; attribs: Attributes; text?: string ; }

  type Transformer = (tagName: string, attribs: Attributes) => Tag;

  type AllowedAttribute = string | { name: string; multiple?: boolean ; values: string[] };

  type DisallowedTagsModes = 'discard' | 'escape' | 'recursiveEscape';

  interface IDefaults {
    allowedAttributes: Record<string, AllowedAttribute[]>;
    allowedSchemes: string[];
    allowedSchemesByTag: { [index: string]: string[] };
    allowedSchemesAppliedToAttributes: string[];
    allowedTags: string[];
    allowProtocolRelative: boolean;
    disallowedTagsMode: DisallowedTagsModes;
    enforceHtmlBoundary: boolean;
    selfClosing: string[];
  }

  interface IFrame {
    tag: string;
    attribs: { [index: string]: string };
    text: string;
    tagPosition: number;
  }

  interface IOptions {
    allowedAttributes?: Record<string, AllowedAttribute[]> | false;
    allowedStyles?: { [index: string]: { [index: string]: RegExp[] } } ;
    allowedClasses?: { [index: string]: boolean | Array<string | RegExp> };
    allowIframeRelativeUrls?: boolean ;
    allowedSchemes?: string[] | boolean ;
    allowedSchemesByTag?: { [index: string]: string[] } | boolean ;
    allowedSchemesAppliedToAttributes?: string[] ;
    allowProtocolRelative?: boolean ;
    allowedTags?: string[] | false ;
    allowVulnerableTags?: boolean ;
    textFilter?: ((text: string, tagName: string) => string) ;
    exclusiveFilter?: ((frame: IFrame) => boolean) ;
    nonTextTags?: string[] ;
    selfClosing?: string[] ;
    transformTags?: { [tagName: string]: string | Transformer } ;
    parser?: ParserOptions ;
    disallowedTagsMode?: DisallowedTagsModes ;
    enforceHtmlBoundary?: boolean ;
  }

  const defaults: IDefaults;
  const options: IOptions;

  function simpleTransform(tagName: string, attribs: Attributes, merge?: boolean): Transformer;
  }

接口定义:

方法名入参接口描述
new Parser(cbs: Partialnull, options?: ParserOptions) handler,ParserOptions创建HTML解析器
write(chunk: string): voidstring向HTML解析器内写入数据,解析一大块数据并调用相应的回调。
end(chunk?: string): voidstring解析缓冲区的末尾并清除堆栈,调用 onend。
parseComplete(data: string): voidstring重置解析器,然后解析完整的文档并将其推送到处理程序。
parseDocument(data: string, options?: ParserOptions): Documentstring,ParserOptions解析数据,返回结果文档。
SanitizeHtml(dirty: string, options?: sanitize.IOptions): stringstring,sanitize.IOptions清理HTML,实现HTML可信化
new XMLWriter(html: string, property?: Array)string,Array创建XHTML转换器对象
convertToXML(callback: (content: stringnull, error?: Error) => void):void callback将HTML转化为XHTML
new ParserHTMLJson.default(html: string)html创建HTML json解析器
getClassStyleJson()提取css
getHtmlJson()获取html的json格式字符串

约束与限制

在下述版本验证通过:

DevEco Studio: 4.1 Canary(4.1.3.317),OpenHarmony SDK:API11 (4.1.0.36)

目录结构

|---- jsoup  
|     |---- entry  # 示例代码文件夹
|        |----src/main/ets
|            |pages
|                |----addTag.ets
|                |----index.ets
|                |----showResult.ets
|     |---- library # 将HTML转化为XHTMl功能库
|     |---- README.md  # 安装使用方法
|     |---- README_zh.md  # 安装使用方法