chineseBert

别样的embeddings--拼音,字形;推演开来:表情,谐音embeddings

特别适合黑产文本的识别, 类似这样的同音同义文本:

Aa01A01= qq 扣扣 QQ 企鹅 抠抠 ＱＱ 🐧🐧 叩叩 扣抠 q，q 企鹅号 球球
Aa01A02= 微信 ⅴ信 weixin 违心 wx 威信 微杏 魏❤ 薇❤ 威星 魏信 ⅴ╳ 卫⭐ 微号 微信号 w信 威信号 违心 微x 薇心 薇辛
Aa01A03= 进群 +群 加群 加峮 +裙 加裙 来群 +裙 进峮 家裙 +扣扣裙 ➕抠裙 +扣裙 加企鹅群 抠郡
Aa01A04= +q +Q 加q 加扣 加抠 加🐧 +🐧 加我q ➕q 加企鹅号 ➕🐧 ➕q，q
Aa01A05= mai号 卖号 卖hao 麦号 maihao mài号 在线出售号 荬hao 荬号 脉号 mai账号
Aa01A06= 租号 zu号 zuhao 租账号 zu账号 出租此号 租账号
Aa01A07= 佳v 佳个v +v

0   零  〇  〇  ⁰  ₀  ０
1  一  壹  ①  ¹  ₁  １
2  二  贰  ②  ²  ₂  ２
3  三  叁  ③  ³  ₃  ３
4  四  肆  ④  ⁴  ₄  ４
5  五  伍  ⑤  ⁵  ₅  ５
6  六  陆  ⑥  ⁶  ₆  ６
7  七  柒  ⑦  ⁷  ₇  ７
9  九  玖  ⑨  ⁹  ₉  ９
8  八  捌  ⑧  ⁸  ₈  ８
加  ＋  佳  +  嘉  家  伽  迦  珈  茄  颊  痂  茄  ➕  jia
群  裙  峮  qun
卖  麦  荬  脉  迈

#脱敏混淆数据
+个卫⭐15１916216 
+个吗：扣抠7613１36    
+他v一3125₇377⁸ 
+企鹅号群八⑦零八五八 
+佳伽1347三⁶伍12 
+妹妹抠抠,陆₆98₁753    
+微信①⑨⑨②③⑥①⑤③ 
+我:3²⑨1２壹6  
+我QQ6453985  
+我q26⁰6₃₇0₉  
+我q3436⁰0６5〇  
+我q贰³9八887   
+我q，q18叁３685 
+我q，q6⁹15361给你玩  
+我q，q吧，4〇⁹62柒4   
+我q，q老婆贰2３06陆⑤7²  
+我v13〇¹730171 
+我zzzz5881   
+我一8771壹427

resnet随想

能否skip connect 跨越多层是否会更好?浅层和深层的融合.

git仓库瘦身工具 BFG

场景:

总会有一些实习生不太懂使用git,把大模型或数据集上传到git中,造成了仓库动辄几G或几十G,clone特别慢,影响工作效率.

方案选择:

方案1: 网上很多教程,通过命令查找大文件然后进行删除比如这篇:blog.csdn.net/luchengtao1… 但速度特别慢,有时候情况比较特殊还不一定能解决,一番折腾还是没能解决问题. 不推荐.

方案2: BFG,请毫不犹豫使用它,丝滑般流畅

非要说缺点,也有一点:要配置java环境,稍微麻烦了一点,但磨刀不误砍柴工.

BFG使用特别注意事项:

注意1:最新分支的最新内容是不会进行清理的,具体原因看官网.

注意2:清理完后,需要git push不一定能成功,报错信息如下

remote: GitLab: You are not allowed to force push code to a protected branch on this project.

先取消保护分支即可

什么是`one-hot`

连续表情去除

# 连续出现的无意义符号
import emoji
flags = re.search(emoji.get_emoji_regexp().pattern + '{2,}', item_str)
if flags:
    flags.group()
    # re_str = '(?P<value>' + emoji.get_emoji_regexp().pattern.replace('(', '') + '{2,}'
    # 重复的表情符号只保留一个：
    # example:✊✊✊✊✋✋✋✋ --->✊✋
    new_emoji = ''.join(sorted(set(flags.group()), key=flags.group().index))
    emoji_count += 1
    new_str = item_str.replace(flags.group(), new_emoji)
    # print(new_str, '\n')
    str_all_remove = remove_all_special_symbols(new_str)
    if len(str_all_remove) == 0 and not any(
            emoji_item in new_str for emoji_item in ['❌', '⭕', '🖕🏼', '🐔']):
        # 纯表情或纯特殊符号的去除
        print(item_str)
        print('all negative:', new_str)
        print('remove:', item, '\n')

去除所有特殊符号

def remove_all_special_symbols(content):
    result = ''
    for char in content:
        i2 = ord(char)
        if (i2 >= 0x4e00 and i2 <= 0x9fa5) or \
                (i2 >= 0x3400 and i2 <= 0x4db5) or \
                (i2 >= 0x0030 and i2 <= 0x003a) or \
                (i2 >= 0x0061 and i2 <= 0x007b):
            result += char
    return result

go+tensorflow+gpu 环境配置

配环境配了一天,很心塞...

最大的坑就是按照官网的根本跑不通...

package main

import (
        "fmt"
        tg "github.com/galeone/tfgo"
        tf "github.com/galeone/tensorflow/tensorflow/go"
)

func main() {
        root := tg.NewRoot()
        A := tg.NewTensor(root, tg.Const(root, [2][2]int32{{1, 2}, {-1, -2}}))
        x := tg.NewTensor(root, tg.Const(root, [2][1]int64{{10}, {100}}))
        b := tg.NewTensor(root, tg.Const(root, [2][1]int32{{-10}, {10}}))
        Y := A.MatMul(x.Output).Add(b.Output)
        // Please note that Y is just a pointer to A!

        // If we want to create a different node in the graph, we have to clone Y
        // or equivalently A
        Z := A.Clone()
        results := tg.Exec(root, []tf.Output{Y.Output, Z.Output}, nil, &tf.SessionOptions{})
        fmt.Println("Y: ", results[0].Value(), "Z: ", results[1].Value())
        fmt.Println("Y == A", Y == A) // ==> true
        fmt.Println("Z == A", Z == A) // ==> false
}

root@98d4e01d6441:~/test# go run main.go
2021-06-02 03:31:17.394266: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-02 03:31:17.425760: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-02 03:31:17.425997: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-06-02 03:31:17.426492: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-06-02 03:31:18.147130: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-02 03:31:18.147571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-02 03:31:18.147607: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-02 03:31:18.147897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties:
pciBusID: 0000:02:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-02 03:31:18.147926: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-02 03:31:18.148214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties:
pciBusID: 0000:05:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-02 03:31:18.148243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-06-02 03:31:18.148527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties:
pciBusID: 0000:06:00.0 name: GeForce RTX 3080 computeCapability: 8.6
coreClock: 1.74GHz coreCount: 68 deviceMemorySize: 9.78GiB deviceMemoryBandwidth: 707.88GiB/s
2021-06-02 03:31:18.148539: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-06-02 03:31:18.150121: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-06-02 03:31:18.150139: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-06-02 03:31:18.150620: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-06-02 03:31:18.150743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-06-02 03:31:18.150868: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-06-02 03:31:18.151220: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-06-02 03:31:18.151305: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-06-02 03:31:18.151313: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-06-02 03:31:18.151334: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-02 03:31:18.151340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1 2 3
2021-06-02 03:31:18.151346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N N N N
2021-06-02 03:31:18.151349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   N N N N
2021-06-02 03:31:18.151353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 2:   N N N N
2021-06-02 03:31:18.151357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 3:   N N N N
2021-06-02 03:31:18.151983: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3000000000 Hz
Y:  [[200] [-200]] Z:  [[200] [-200]]
Y == A true
Z == A false

鉴黄模型NudeNet初步调研

两大功能:

图片分类:区分黄色图片与非黄色图片
目标检测:

敏感部位是否暴露
可检测臀部,腹部,胸部,生殖部位,腋下
可检测是否穿内衣
可区分男女

github:

github.com/notAI-tech/…

测试代码:

import os
from nudenet import NudeDetector
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow,figure
import numpy as np
# initialize detector (downloads the checkpoint file automatically the first time)
detector = NudeDetector() # detector = NudeDetector('base') for the "base" version of detector.
data_dir = './test_images/small_data/train_data/sexy/'
image_list = os.listdir(data_dir)
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSerif-Bold.ttf", 50)
# Detect single image /home/xin/.nudenet
%matplotlib inline
import random
random.shuffle(image_list)
for image_name in image_list[:10]:
    image_path = os.path.join(data_dir,image_name)
    result = detector.detect(image_path)
    im = Image.open(image_path) 
    draw = ImageDraw.Draw(im)
    plt.figure(figsize=(8, 6), dpi=80)
    print('--------------------------')
    
    for item in result:
        box = item['box']
        score = item['score']
        label = item['label'].lower()
        draw.rectangle(box,width=5)
#             draw.text(box[:2],'score:\n'+str(score)+'\nlabel:\n'+label,font=font,fill=(255,0,0,255))
        draw.text(box[:2],label,font=font,fill=(255,0,0,255))
        print(label)
    
    imgplot = plt.imshow(np.asarray(im))
    plt.show()

开发经验分享-9月