12-大模型数据质量工程

6 阅读22分钟

开篇:数据质量决定模型上限

在大模型时代,有一句话越来越被验证:"Garbage in, garbage out"(垃圾进,垃圾出)。

真实案例:数据质量的巨大影响

案例1:GPT-3到GPT-3.5的飞跃

GPT-3(2020):
  预训练数据:300B tokens(CommonCrawl为主)
  效果:理解能力强,但经常产生无意义内容

GPT-3.5(2022):
  预训练数据:类似规模,但经过严格过滤
  + 高质量指令数据(InstructGPT)
  效果:质的飞跃,可用性大幅提升 ✅

关键差异:不是数据量,而是数据质量!

案例2:LLaMA的成功秘诀

Meta的LLaMA论文揭示:

模型数据量数据质量策略性能
OPT-175B180B tokens标准清洗基准
LLaMA-65B1.4T tokens严格过滤+去重超越OPT ✅

LLaMA用更小的模型(65B vs 175B),通过更多高质量数据,达到更好效果。

案例3:低质量数据的灾难

场景:用Reddit数据训练对话模型

原始数据(未过滤):
  "你好""滚"
  "怎么办?""关我屁事"
  "谢谢""傻X"

训练后的模型:
  用户:"你好"
  模型:"滚开,别烦我" ❌

问题:模型学到了有毒、不友好的表达

数据质量的量化影响

实验数据(7B模型,在不同质量数据上训练):

数据类型数据量MMLUHumanEvalAlpacaEval训练成本
原始爬虫数据1T tokens45.2%12.3%52.1%100%
基础清洗800B tokens52.8%18.5%65.3%80%
高质量过滤500B tokens61.5%28.7%78.9%50%

观察

  • 数据量减少50%
  • 训练成本减半
  • 性能提升25-130%

结论:质量 > 数量!


第一部分:数据质量的六个维度

维度1:准确性(Accuracy)

定义:数据内容的正确性和真实性。

问题示例

# 错误的事实
"埃菲尔铁塔位于英国伦敦""1+1=3""Python是1995年发明的" ❌ (实际是1991年)

# 正确的事实
"埃菲尔铁塔位于法国巴黎""1+1=2""Python是1991年发明的"

检测方法

import re

def check_factual_accuracy(text):
    """简单的事实检查"""
    errors = []

    # 规则1:检查基本数学
    math_pattern = r'(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)'
    for match in re.finditer(math_pattern, text):
        a, b, result = map(int, match.groups())
        if a + b != result:
            errors.append(f"数学错误: {match.group()}")

    # 规则2:检查常识性错误
    false_facts = [
        (r'埃菲尔铁塔.*伦敦', '埃菲尔铁塔在巴黎,不在伦敦'),
        (r'中国.*首都.*上海', '中国首都是北京,不是上海'),
    ]

    for pattern, error_msg in false_facts:
        if re.search(pattern, text):
            errors.append(error_msg)

    return errors

# 测试
text = "埃菲尔铁塔位于伦敦,2+2=5"
errors = check_factual_accuracy(text)
print(f"发现 {len(errors)} 个错误:")
for error in errors:
    print(f"  - {error}")

实际影响

数据准确率模型幻觉率可信度
60%45% ❌
80%25% ⚠️
95%8% ✅
99%3% ✅✅很高

维度2:完整性(Completeness)

定义:数据是否包含必要的信息。

问题示例

# 不完整的数据
incomplete_data = {
    "instruction": "总结这篇文章",
    "input": "",  # ❌ 缺少文章内容
    "output": "这篇文章讨论了..."
}

# 完整的数据
complete_data = {
    "instruction": "总结这篇文章",
    "input": "人工智能是...[完整文章]",  # ✅ 包含必要信息
    "output": "这篇文章讨论了人工智能的发展..."
}

检测方法

def check_completeness(sample):
    """检查数据完整性"""
    issues = []

    # 检查必需字段
    required_fields = ['instruction', 'input', 'output']
    for field in required_fields:
        if field not in sample:
            issues.append(f"缺少字段: {field}")
        elif not sample[field] or sample[field].strip() == '':
            issues.append(f"字段为空: {field}")

    # 检查长度
    if 'output' in sample and len(sample['output']) < 10:
        issues.append("输出过短,可能不完整")

    # 检查截断
    if sample.get('output', '').endswith('...'):
        issues.append("输出可能被截断")

    return issues

# 测试
samples = [
    {"instruction": "翻译", "input": "Hello", "output": "你好"},
    {"instruction": "总结", "input": "", "output": "文章讲述了..."},
    {"instruction": "写作", "input": "写一篇文章", "output": "标题:..."}
]

for i, sample in enumerate(samples):
    issues = check_completeness(sample)
    if issues:
        print(f"样本 {i} 问题: {issues}")

维度3:一致性(Consistency)

定义:数据格式和语义的统一性。

问题示例

# 格式不一致
inconsistent_data = [
    {"question": "什么是AI?", "answer": "人工智能是..."},
    {"query": "AI是什么?", "response": "..."},  # ❌ 字段名不同
    {"q": "AI定义", "a": "..."}  # ❌ 字段名缩写
]

# 语义不一致
semantic_inconsistency = [
    {
        "input": "北京的天气",
        "output": "上海今天晴天"  # ❌ 答非所问
    },
    {
        "input": "1+1等于几?",
        "output": "Python是一种编程语言"  # ❌ 无关回答
    }
]

检测方法

from collections import Counter

def check_consistency(dataset):
    """检查数据集一致性"""
    issues = []

    # 1. 检查字段一致性
    field_names = [set(sample.keys()) for sample in dataset]
    field_counter = Counter([frozenset(fields) for fields in field_names])

    if len(field_counter) > 1:
        issues.append(f"发现 {len(field_counter)} 种不同的字段组合")
        print("字段组合:")
        for fields, count in field_counter.most_common():
            print(f"  {set(fields)}: {count} 个样本")

    # 2. 检查格式一致性
    date_formats = []
    for sample in dataset:
        text = str(sample)
        if re.search(r'\d{4}-\d{2}-\d{2}', text):
            date_formats.append('YYYY-MM-DD')
        elif re.search(r'\d{2}/\d{2}/\d{4}', text):
            date_formats.append('MM/DD/YYYY')

    if len(set(date_formats)) > 1:
        issues.append(f"日期格式不一致: {set(date_formats)}")

    return issues

# 测试
mixed_dataset = [
    {"question": "Q1", "answer": "A1"},
    {"query": "Q2", "response": "A2"},
    {"q": "Q3", "a": "A3"}
]

issues = check_consistency(mixed_dataset)
print(f"一致性问题: {issues}")

维度4:多样性(Diversity)

定义:数据涵盖的主题、风格、难度的广泛性。

问题示例

# 缺乏多样性(所有样本都是简单的加法)
low_diversity = [
    {"input": "1+1", "output": "2"},
    {"input": "2+2", "output": "4"},
    {"input": "3+3", "output": "6"},
    # ... 1000个类似样本
]
# 问题:模型只会简单加法,不会乘法、除法

# 高多样性
high_diversity = [
    {"input": "1+1", "output": "2"},  # 加法
    {"input": "3×4", "output": "12"},  # 乘法
    {"input": "求解 x²=9", "output": "x=±3"},  # 方程
    {"input": "微积分:∫x dx", "output": "x²/2+C"},  # 高级数学
]

评估方法

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def measure_diversity(texts):
    """衡量文本多样性"""
    # 1. 词汇多样性
    all_words = []
    for text in texts:
        words = text.lower().split()
        all_words.extend(words)

    vocabulary_size = len(set(all_words))
    total_words = len(all_words)
    lexical_diversity = vocabulary_size / total_words

    # 2. 语义多样性(基于TF-IDF相似度)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(texts)

    # 计算平均相似度
    similarities = []
    n = len(texts)
    for i in range(n):
        for j in range(i+1, n):
            sim = cosine_similarity(tfidf_matrix[i], tfidf_matrix[j])[0][0]
            similarities.append(sim)

    avg_similarity = np.mean(similarities)
    semantic_diversity = 1 - avg_similarity  # 相似度越低,多样性越高

    return {
        'lexical_diversity': lexical_diversity,
        'semantic_diversity': semantic_diversity,
        'vocabulary_size': vocabulary_size
    }

# 测试
low_div_texts = [
    "今天天气很好",
    "今天天气真好",
    "今天天气特别好"
]

high_div_texts = [
    "今天天气很好",
    "讨论量子物理的基本原理",
    "如何制作美味的意大利面"
]

print("低多样性:", measure_diversity(low_div_texts))
print("高多样性:", measure_diversity(high_div_texts))

多样性对模型的影响

多样性指标MMLU泛化能力过拟合风险
低 (<0.3)45%高 ❌
中 (0.3-0.6)58%中等中等
高 (>0.6)67%

维度5:相关性(Relevance)

定义:数据与训练目标的相关程度。

问题示例

# 训练目标:医疗问答模型

# 不相关数据
irrelevant_data = [
    {"input": "如何做红烧肉?", "output": "..."},  # ❌ 烹饪
    {"input": "股票怎么买?", "output": "..."},  # ❌ 金融
    {"input": "Python语法", "output": "..."}  # ❌ 编程
]

# 相关数据
relevant_data = [
    {"input": "感冒了怎么办?", "output": "..."},  # ✅ 医疗
    {"input": "高血压的症状", "output": "..."},  # ✅ 医疗
    {"input": "如何预防糖尿病", "output": "..."}  # ✅ 医疗
]

检测方法

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

class RelevanceClassifier:
    def __init__(self):
        self.vectorizer = CountVectorizer()
        self.classifier = MultinomialNB()

    def train(self, texts, labels):
        """
        训练相关性分类器
        texts: 文本列表
        labels: 标签(1=相关,0=不相关)
        """
        X = self.vectorizer.fit_transform(texts)
        self.classifier.fit(X, labels)

    def predict(self, texts):
        """预测相关性"""
        X = self.vectorizer.transform(texts)
        probs = self.classifier.predict_proba(X)
        return probs[:, 1]  # 返回相关的概率

# 训练示例(医疗领域)
train_texts = [
    "感冒症状",  # 相关
    "高血压治疗",  # 相关
    "红烧肉做法",  # 不相关
    "股票投资",  # 不相关
]
train_labels = [1, 1, 0, 0]

classifier = RelevanceClassifier()
classifier.train(train_texts, train_labels)

# 测试
test_texts = [
    "糖尿病预防",
    "意大利面的做法",
    "心脏病的早期征兆"
]

scores = classifier.predict(test_texts)
for text, score in zip(test_texts, scores):
    relevance = "相关" if score > 0.5 else "不相关"
    print(f"{text}: {relevance} (置信度: {score:.2f})")

维度6:安全性(Safety)

定义:数据不包含有害、有毒、偏见的内容。

问题示例

# 有毒内容
toxic_data = [
    "你这个傻X,滚",  # ❌ 辱骂
    "去死吧",  # ❌ 暴力
    "XX族都是垃圾",  # ❌ 种族歧视
]

# 安全内容
safe_data = [
    "谢谢你的帮助",  # ✅ 友好
    "我不太理解这个问题",  # ✅ 中性
    "让我们一起解决这个问题",  # ✅ 积极
]

毒性检测

import re

class ToxicityDetector:
    def __init__(self):
        # 简化的毒性词典(实际应用中应该更全面)
        self.toxic_keywords = {
            'profanity': ['傻X', '去死', '垃圾', '白痴'],
            'violence': ['杀', '打死', '暴力'],
            'discrimination': ['歧视', '低等', '劣等']
        }

    def detect(self, text):
        """检测文本毒性"""
        results = {
            'is_toxic': False,
            'toxic_types': [],
            'toxic_words': []
        }

        for toxic_type, keywords in self.toxic_keywords.items():
            for keyword in keywords:
                if keyword in text:
                    results['is_toxic'] = True
                    results['toxic_types'].append(toxic_type)
                    results['toxic_words'].append(keyword)

        return results

    def compute_toxicity_score(self, text):
        """计算毒性分数(0-1)"""
        detection = self.detect(text)
        if not detection['is_toxic']:
            return 0.0

        # 根据毒性类型和数量计算分数
        base_score = 0.3
        type_penalty = len(detection['toxic_types']) * 0.2
        word_penalty = len(detection['toxic_words']) * 0.1

        return min(1.0, base_score + type_penalty + word_penalty)

# 使用
detector = ToxicityDetector()

test_texts = [
    "今天天气很好",
    "你这个傻X",
    "去死吧,垃圾"
]

for text in test_texts:
    result = detector.detect(text)
    score = detector.compute_toxicity_score(text)
    print(f"文本: {text}")
    print(f"  有毒: {result['is_toxic']}")
    print(f"  类型: {result['toxic_types']}")
    print(f"  毒性分数: {score:.2f}\n")

实际工具推荐

# 使用Perspective API(Google)
# pip install google-api-python-client

from googleapiclient import discovery

def check_toxicity_perspective(text, api_key):
    """使用Perspective API检测毒性"""
    client = discovery.build(
        "commentanalyzer",
        "v1alpha1",
        developerKey=api_key,
        discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1"
    )

    analyze_request = {
        'comment': {'text': text},
        'requestedAttributes': {
            'TOXICITY': {},
            'SEVERE_TOXICITY': {},
            'IDENTITY_ATTACK': {},
            'INSULT': {},
            'PROFANITY': {},
            'THREAT': {}
        }
    }

    response = client.comments().analyze(body=analyze_request).execute()

    scores = {}
    for attr, data in response['attributeScores'].items():
        scores[attr] = data['summaryScore']['value']

    return scores

第二部分:数据清洗

清洗策略1:格式标准化

问题:数据格式五花八门,需要统一。

class DataFormatter:
    """数据格式标准化"""

    def __init__(self):
        self.target_format = {
            'instruction': str,
            'input': str,
            'output': str
        }

    def normalize_whitespace(self, text):
        """标准化空白字符"""
        # 移除多余空格
        text = re.sub(r'\s+', ' ', text)
        # 移除首尾空格
        text = text.strip()
        return text

    def normalize_punctuation(self, text):
        """标准化标点符号"""
        # 中文标点 → 英文标点
        replacements = {
            ',': ',',
            '。': '.',
            '!': '!',
            '?': '?',
            ':': ':',
            ';': ';',
            '"': '"',
            '"': '"',
            ''': "'",
            ''': "'"
        }

        for old, new in replacements.items():
            text = text.replace(old, new)

        return text

    def normalize_sample(self, sample):
        """标准化单个样本"""
        normalized = {}

        # 字段名标准化
        field_mapping = {
            'question': 'instruction',
            'query': 'instruction',
            'prompt': 'instruction',
            'answer': 'output',
            'response': 'output',
            'completion': 'output',
            'context': 'input',
            'text': 'input'
        }

        for key, value in sample.items():
            # 映射字段名
            normalized_key = field_mapping.get(key, key)

            # 标准化文本
            if isinstance(value, str):
                value = self.normalize_whitespace(value)
                value = self.normalize_punctuation(value)

            normalized[normalized_key] = value

        # 确保必需字段存在
        for field in self.target_format:
            if field not in normalized:
                normalized[field] = ""

        return normalized

# 使用示例
formatter = DataFormatter()

messy_data = [
    {
        'question': '  什么是AI?  ',  # 多余空格
        'answer': 'AI是人工智能。。。'  # 中文标点
    },
    {
        'query': 'Python是什么',
        'response': 'Python是编程语言'
    }
]

cleaned_data = [formatter.normalize_sample(sample) for sample in messy_data]

for sample in cleaned_data:
    print(sample)

清洗策略2:噪声过滤

常见噪声类型

class NoiseFilter:
    """噪声过滤器"""

    def filter_html(self, text):
        """移除HTML标签"""
        # <p>Hello</p> → Hello
        text = re.sub(r'<[^>]+>', '', text)
        # &nbsp; → 空格
        text = re.sub(r'&\w+;', ' ', text)
        return text

    def filter_urls(self, text):
        """移除URL"""
        # http://example.com → [URL]
        text = re.sub(
            r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
            '[URL]',
            text
        )
        return text

    def filter_special_chars(self, text):
        """移除特殊字符"""
        # 保留:字母、数字、常用标点、中文
        text = re.sub(r'[^\w\s,.!?;:()(),。!?;:\u4e00-\u9fff]', '', text)
        return text

    def filter_repeated_chars(self, text):
        """处理重复字符"""
        # "哈哈哈哈哈哈" → "哈哈哈"
        text = re.sub(r'(.)\1{3,}', r'\1\1\1', text)
        return text

    def filter_noise(self, text):
        """综合过滤"""
        text = self.filter_html(text)
        text = self.filter_urls(text)
        text = self.filter_special_chars(text)
        text = self.filter_repeated_chars(text)
        return text

# 测试
filter = NoiseFilter()

noisy_text = """
<p>访问 https://example.com 了解更多!!!!</p>
哈哈哈哈哈哈哈哈哈 太好笑了@#$%
"""

cleaned = filter.filter_noise(noisy_text)
print(f"原始: {noisy_text}")
print(f"清洗: {cleaned}")

清洗策略3:去重

方法1:精确去重

def exact_dedup(dataset):
    """精确去重"""
    seen = set()
    deduped = []

    for sample in dataset:
        # 将样本转换为可哈希的字符串
        key = f"{sample['instruction']}||{sample['input']}||{sample['output']}"

        if key not in seen:
            seen.add(key)
            deduped.append(sample)

    removed = len(dataset) - len(deduped)
    print(f"移除 {removed} 个重复样本 ({removed/len(dataset)*100:.1f}%)")

    return deduped

方法2:模糊去重(MinHash)

from datasketch import MinHash, MinHashLSH

class FuzzyDeduplicator:
    """基于MinHash的模糊去重"""

    def __init__(self, threshold=0.8, num_perm=128):
        """
        Args:
            threshold: 相似度阈值(0-1),超过则认为重复
            num_perm: MinHash的排列数(越大越准确但越慢)
        """
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)

    def create_minhash(self, text):
        """为文本创建MinHash签名"""
        # 分词(简单按空格分)
        tokens = text.lower().split()

        # 创建MinHash
        minhash = MinHash(num_perm=self.num_perm)
        for token in tokens:
            minhash.update(token.encode('utf8'))

        return minhash

    def deduplicate(self, texts):
        """去重文本列表"""
        unique_texts = []
        unique_indices = []

        for idx, text in enumerate(texts):
            minhash = self.create_minhash(text)

            # 检查是否有相似的已存在
            similar = self.lsh.query(minhash)

            if not similar:
                # 没有相似的,添加
                self.lsh.insert(f"doc_{idx}", minhash)
                unique_texts.append(text)
                unique_indices.append(idx)

        return unique_texts, unique_indices

# 使用示例
texts = [
    "今天天气很好",
    "今天天气真不错",  # 与第1条相似
    "量子计算的原理",
    "今天的天气非常好",  # 与第1条相似
    "机器学习的应用"
]

dedup = FuzzyDeduplicator(threshold=0.7)
unique_texts, indices = dedup.deduplicate(texts)

print(f"原始: {len(texts)} 条")
print(f"去重后: {len(unique_texts)} 条")
print(f"移除: {len(texts) - len(unique_texts)} 条")
print(f"\n保留的文本:")
for idx, text in zip(indices, unique_texts):
    print(f"  [{idx}] {text}")

清洗策略4:异常检测

class AnomalyDetector:
    """异常样本检测"""

    def check_length_anomaly(self, texts, z_threshold=3.0):
        """检测长度异常"""
        lengths = [len(text) for text in texts]
        mean_len = np.mean(lengths)
        std_len = np.std(lengths)

        anomalies = []
        for idx, length in enumerate(lengths):
            z_score = abs((length - mean_len) / std_len)
            if z_score > z_threshold:
                anomalies.append({
                    'index': idx,
                    'length': length,
                    'z_score': z_score,
                    'reason': '长度异常'
                })

        return anomalies

    def check_repetition_anomaly(self, text):
        """检测重复异常"""
        # 检测句子级别重复
        sentences = text.split('.')
        if len(sentences) > 1:
            unique_sentences = len(set(sentences))
            repetition_ratio = 1 - (unique_sentences / len(sentences))

            if repetition_ratio > 0.5:
                return True, repetition_ratio

        return False, 0.0

    def check_quality_anomaly(self, text):
        """检测质量异常"""
        issues = []

        # 1. 过短
        if len(text) < 10:
            issues.append("文本过短")

        # 2. 全是标点
        if len(re.findall(r'[^\w\s]', text)) / len(text) > 0.5:
            issues.append("标点过多")

        # 3. 大写字母过多
        if sum(1 for c in text if c.isupper()) / len(text) > 0.5:
            issues.append("大写字母过多")

        # 4. 数字过多
        if sum(1 for c in text if c.isdigit()) / len(text) > 0.7:
            issues.append("数字过多")

        return len(issues) > 0, issues

    def detect_all(self, texts):
        """检测所有异常"""
        all_anomalies = []

        # 1. 长度异常
        length_anomalies = self.check_length_anomaly(texts)
        all_anomalies.extend(length_anomalies)

        # 2. 其他异常
        for idx, text in enumerate(texts):
            # 重复检测
            is_rep, rep_ratio = self.check_repetition_anomaly(text)
            if is_rep:
                all_anomalies.append({
                    'index': idx,
                    'repetition_ratio': rep_ratio,
                    'reason': '内容重复'
                })

            # 质量检测
            is_bad, issues = self.check_quality_anomaly(text)
            if is_bad:
                all_anomalies.append({
                    'index': idx,
                    'issues': issues,
                    'reason': '质量问题'
                })

        return all_anomalies

# 使用
detector = AnomalyDetector()

test_texts = [
    "正常的文本内容,长度适中。" * 10,  # 正常
    "短",  # 异常:过短
    "!!!!!!!!!!!!!!!!!!!!!",  # 异常:全是标点
    "这是一句话。这是一句话。这是一句话。这是一句话。",  # 异常:重复
    "1234567890123456789012345",  # 异常:数字过多
]

anomalies = detector.detect_all(test_texts)
print(f"检测到 {len(anomalies)} 个异常:")
for anomaly in anomalies:
    print(f"  索引 {anomaly['index']}: {anomaly['reason']}")

第三部分:数据过滤

过滤策略1:基于规则

class RuleBasedFilter:
    """基于规则的过滤器"""

    def __init__(self):
        self.rules = []

    def add_rule(self, name, check_func, action='remove'):
        """添加过滤规则"""
        self.rules.append({
            'name': name,
            'check': check_func,
            'action': action
        })

    def filter_dataset(self, dataset):
        """应用所有规则过滤数据集"""
        filtered = []
        stats = {rule['name']: 0 for rule in self.rules}

        for sample in dataset:
            keep = True

            for rule in self.rules:
                if not rule['check'](sample):
                    stats[rule['name']] += 1
                    if rule['action'] == 'remove':
                        keep = False
                        break

            if keep:
                filtered.append(sample)

        print("过滤统计:")
        for rule_name, count in stats.items():
            print(f"  {rule_name}: {count} 个样本")

        print(f"\n保留: {len(filtered)}/{len(dataset)} ({len(filtered)/len(dataset)*100:.1f}%)")

        return filtered

# 定义规则
filter = RuleBasedFilter()

# 规则1:最小长度
filter.add_rule(
    name="最小长度检查",
    check_func=lambda s: len(s.get('output', '')) >= 10
)

# 规则2:不包含有毒词汇
toxic_words = ['傻X', '去死', '垃圾']
filter.add_rule(
    name="毒性检查",
    check_func=lambda s: not any(word in s.get('output', '') for word in toxic_words)
)

# 规则3:输入输出不能完全相同
filter.add_rule(
    name="输入输出不同",
    check_func=lambda s: s.get('input', '') != s.get('output', '')
)

# 规则4:必须包含字母或中文
filter.add_rule(
    name="包含有效字符",
    check_func=lambda s: re.search(r'[a-zA-Z\u4e00-\u9fff]', s.get('output', '')) is not None
)

# 测试
test_dataset = [
    {'input': 'Q1', 'output': '这是一个正常的回答'},
    {'input': 'Q2', 'output': '短'},  # 太短
    {'input': 'Q3', 'output': '你这个傻X'},  # 有毒
    {'input': 'Q4', 'output': 'Q4'},  # 输入输出相同
    {'input': 'Q5', 'output': '12345!!!'},  # 无有效字符
    {'input': 'Q6', 'output': '另一个正常的回答'},
]

filtered_dataset = filter.filter_dataset(test_dataset)

过滤策略2:基于困惑度(Perplexity)

原理:使用语言模型计算困惑度,过滤掉困惑度过高(低质量)的数据。

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class PerplexityFilter:
    """基于困惑度的过滤器"""

    def __init__(self, model_name='gpt2', threshold=100.0):
        self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
        self.model = GPT2LMHeadModel.from_pretrained(model_name)
        self.model.eval()
        self.threshold = threshold

    def compute_perplexity(self, text):
        """计算文本的困惑度"""
        # Tokenize
        inputs = self.tokenizer(text, return_tensors='pt')
        input_ids = inputs['input_ids']

        with torch.no_grad():
            outputs = self.model(input_ids, labels=input_ids)
            loss = outputs.loss

        # Perplexity = exp(loss)
        perplexity = torch.exp(loss).item()

        return perplexity

    def filter_by_perplexity(self, texts):
        """根据困惑度过滤"""
        filtered = []
        perplexities = []

        for text in texts:
            ppl = self.compute_perplexity(text)
            perplexities.append(ppl)

            if ppl <= self.threshold:
                filtered.append(text)

        print(f"困惑度统计:")
        print(f"  平均: {np.mean(perplexities):.2f}")
        print(f"  中位数: {np.median(perplexities):.2f}")
        print(f"  最大: {np.max(perplexities):.2f}")
        print(f"  最小: {np.min(perplexities):.2f}")
        print(f"\n过滤结果: 保留 {len(filtered)}/{len(texts)}")

        return filtered, perplexities

# 使用(需要GPU和transformers库)
# filter = PerplexityFilter(threshold=100.0)
#
# test_texts = [
#     "This is a normal English sentence.",  # 低困惑度
#     "asdkfjal;skdjf;laksjdf;lkj",  # 高困惑度(乱码)
#     "The quick brown fox jumps over the lazy dog.",  # 低困惑度
# ]
#
# filtered, ppls = filter.filter_by_perplexity(test_texts)

困惑度阈值选择

数据类型建议阈值说明
高质量文本<30书籍、论文
一般网页30-100新闻、博客
社交媒体100-300Twitter、Reddit
过滤噪声>300乱码、spam

过滤策略3:基于分类器

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

class QualityClassifier:
    """质量分类器"""

    def __init__(self):
        self.vectorizer = TfidfVectorizer(max_features=1000)
        self.classifier = RandomForestClassifier(n_estimators=100)

    def train(self, texts, labels):
        """
        训练质量分类器
        texts: 文本列表
        labels: 质量标签(1=高质量,0=低质量)
        """
        X = self.vectorizer.fit_transform(texts)
        self.classifier.fit(X, labels)

    def predict(self, texts):
        """预测文本质量"""
        X = self.vectorizer.transform(texts)
        probs = self.classifier.predict_proba(X)
        return probs[:, 1]  # 返回高质量的概率

    def filter_by_quality(self, texts, threshold=0.5):
        """根据质量过滤"""
        quality_scores = self.predict(texts)

        filtered = []
        for text, score in zip(texts, quality_scores):
            if score >= threshold:
                filtered.append(text)

        print(f"质量过滤:")
        print(f"  平均质量分数: {np.mean(quality_scores):.2f}")
        print(f"  保留: {len(filtered)}/{len(texts)} ({len(filtered)/len(texts)*100:.1f}%)")

        return filtered, quality_scores

# 训练示例
train_texts = [
    "这是一个详细、准确的技术解释...",  # 高质量
    "内容详实,逻辑清晰,表述专业...",  # 高质量
    "不知道啊,随便吧",  # 低质量
    "哈哈哈哈",  # 低质量
    # ... 更多训练数据
]
train_labels = [1, 1, 0, 0]

clf = QualityClassifier()
clf.train(train_texts, train_labels)

# 过滤新数据
new_texts = [
    "详细的技术文档,包含代码示例和原理解释",
    "不错",
    "全面介绍了机器学习的基础概念和应用场景"
]

filtered, scores = clf.filter_by_quality(new_texts, threshold=0.7)

第四部分:数据增强

增强策略1:回译(Back Translation)

原理:文本 → 翻译到另一语言 → 翻译回原语言

from transformers import MarianMTModel, MarianTokenizer

class BackTranslator:
    """回译数据增强"""

    def __init__(self, src_lang='zh', tgt_lang='en'):
        # 中文 → 英文
        self.forward_model_name = f'Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}'
        self.forward_tokenizer = MarianTokenizer.from_pretrained(self.forward_model_name)
        self.forward_model = MarianMTModel.from_pretrained(self.forward_model_name)

        # 英文 → 中文
        self.backward_model_name = f'Helsinki-NLP/opus-mt-{tgt_lang}-{src_lang}'
        self.backward_tokenizer = MarianTokenizer.from_pretrained(self.backward_model_name)
        self.backward_model = MarianMTModel.from_pretrained(self.backward_model_name)

    def translate(self, text, model, tokenizer):
        """翻译文本"""
        inputs = tokenizer(text, return_tensors="pt", padding=True)
        translated = model.generate(**inputs)
        result = tokenizer.decode(translated[0], skip_special_tokens=True)
        return result

    def back_translate(self, text):
        """回译"""
        # 步骤1:中文 → 英文
        en_text = self.translate(text, self.forward_model, self.forward_tokenizer)

        # 步骤2:英文 → 中文
        zh_text = self.translate(en_text, self.backward_model, self.backward_tokenizer)

        return zh_text, en_text

    def augment_dataset(self, texts, num_augments=1):
        """增强数据集"""
        augmented = []

        for text in texts:
            # 保留原文
            augmented.append(text)

            # 生成增强样本
            for _ in range(num_augments):
                aug_text, _ = self.back_translate(text)
                augmented.append(aug_text)

        return augmented

# 使用(需要较长时间下载模型)
# bt = BackTranslator()
#
# original = "今天天气很好"
# augmented, intermediate = bt.back_translate(original)
#
# print(f"原文: {original}")
# print(f"中间(英文): {intermediate}")
# print(f"回译: {augmented}")

# 简化版本(不使用模型)
def simple_back_translate(text):
    """简化的回译模拟"""
    # 同义词替换
    replacements = {
        '很': ['非常', '特别', '十分'],
        '好': ['不错', '棒', '优秀'],
        '天气': ['气候', '天色'],
    }

    import random
    augmented = text
    for old, new_list in replacements.items():
        if old in augmented:
            augmented = augmented.replace(old, random.choice(new_list))

    return augmented

# 测试简化版
original_texts = [
    "今天天气很好",
    "这个方法很有效"
]

for text in original_texts:
    aug = simple_back_translate(text)
    print(f"原文: {text}")
    print(f"增强: {aug}\n")

增强策略2:同义词替换

import random

class SynonymAugmentor:
    """同义词替换增强"""

    def __init__(self):
        # 简化的同义词词典
        self.synonyms = {
            '好': ['不错', '很棒', '优秀', '出色'],
            '坏': ['糟糕', '差劲', '恶劣'],
            '大': ['巨大', '庞大', '宏大'],
            '小': ['微小', '细小', '渺小'],
            '快': ['迅速', '快速', '敏捷'],
            '慢': ['缓慢', '迟缓'],
            '方法': ['方式', '途径', '手段'],
            '问题': ['疑问', '难题', '课题'],
        }

    def replace_with_synonym(self, text, replace_ratio=0.3):
        """用同义词替换文本中的词"""
        words = list(text)

        # 找出可替换的词
        replaceable = []
        for i, word in enumerate(words):
            if word in self.synonyms:
                replaceable.append(i)

        # 随机选择一部分进行替换
        num_replace = max(1, int(len(replaceable) * replace_ratio))
        if replaceable:
            indices = random.sample(replaceable, min(num_replace, len(replaceable)))

            for idx in indices:
                word = words[idx]
                words[idx] = random.choice(self.synonyms[word])

        return ''.join(words)

    def augment(self, texts, num_augments=2):
        """增强数据集"""
        augmented = []

        for text in texts:
            augmented.append(text)  # 保留原文

            for _ in range(num_augments):
                aug_text = self.replace_with_synonym(text)
                augmented.append(aug_text)

        return augmented

# 使用
augmentor = SynonymAugmentor()

original_texts = [
    "这是一个很好的方法",
    "解决这个问题很快"
]

augmented = augmentor.augment(original_texts, num_augments=2)

print("增强后的数据集:")
for i, text in enumerate(augmented):
    print(f"{i+1}. {text}")

增强策略3:合成数据

class SyntheticDataGenerator:
    """合成数据生成器"""

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer

    def generate_from_prompt(self, prompt, num_samples=5, max_length=100):
        """从提示生成数据"""
        generated = []

        for _ in range(num_samples):
            inputs = self.tokenizer(prompt, return_tensors='pt')

            outputs = self.model.generate(
                **inputs,
                max_length=max_length,
                num_return_sequences=1,
                temperature=0.8,
                top_p=0.9,
                do_sample=True
            )

            text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            generated.append(text)

        return generated

    def generate_qa_pairs(self, context, num_pairs=5):
        """基于上下文生成问答对"""
        prompt = f"""基于以下文本,生成{num_pairs}个问答对:

文本:{context}

问答对:
"""

        generated = self.generate_from_prompt(prompt, num_samples=1)

        # 解析生成的问答对(简化版)
        qa_pairs = []
        # ... 解析逻辑

        return qa_pairs

# 模板化合成数据
class TemplateGenerator:
    """基于模板的数据生成"""

    def __init__(self):
        self.templates = {
            'math': [
                "{a} + {b} = {result}",
                "{a} - {b} = {result}",
                "{a} × {b} = {result}",
            ],
            'comparison': [
                "{item1} 比 {item2} {attribute}",
                "{item1} 和 {item2} 相比,更 {attribute}",
            ]
        }

    def generate_math_data(self, num_samples=100):
        """生成数学问题"""
        import random

        data = []
        for _ in range(num_samples):
            a = random.randint(1, 100)
            b = random.randint(1, 100)
            op = random.choice(['+', '-', '×'])

            if op == '+':
                result = a + b
                question = f"{a} + {b} = ?"
            elif op == '-':
                result = a - b
                question = f"{a} - {b} = ?"
            else:
                result = a * b
                question = f"{a} × {b} = ?"

            data.append({
                'input': question,
                'output': str(result)
            })

        return data

# 使用
gen = TemplateGenerator()
math_data = gen.generate_math_data(num_samples=10)

print("合成的数学数据:")
for i, item in enumerate(math_data[:5]):
    print(f"{i+1}. {item['input']}{item['output']}")

第五部分:工程实践

数据处理Pipeline

class DataProcessingPipeline:
    """完整的数据处理流水线"""

    def __init__(self):
        self.formatter = DataFormatter()
        self.noise_filter = NoiseFilter()
        self.quality_classifier = QualityClassifier()
        self.deduplicator = FuzzyDeduplicator()
        self.anomaly_detector = AnomalyDetector()

        self.stats = {
            'original_count': 0,
            'after_format': 0,
            'after_noise': 0,
            'after_dedup': 0,
            'after_quality': 0,
            'after_anomaly': 0,
            'final_count': 0
        }

    def process(self, dataset, config=None):
        """
        处理数据集

        Args:
            dataset: 原始数据集
            config: 配置参数
        """
        self.stats['original_count'] = len(dataset)
        print(f"原始数据: {len(dataset)} 条\n")

        # 步骤1:格式标准化
        print("步骤1: 格式标准化...")
        dataset = [self.formatter.normalize_sample(s) for s in dataset]
        self.stats['after_format'] = len(dataset)
        print(f"  完成,保留 {len(dataset)} 条\n")

        # 步骤2:噪声过滤
        print("步骤2: 噪声过滤...")
        for sample in dataset:
            if 'output' in sample:
                sample['output'] = self.noise_filter.filter_noise(sample['output'])
        self.stats['after_noise'] = len(dataset)
        print(f"  完成\n")

        # 步骤3:去重
        print("步骤3: 去重...")
        texts = [s.get('output', '') for s in dataset]
        unique_texts, unique_indices = self.deduplicator.deduplicate(texts)
        dataset = [dataset[i] for i in unique_indices]
        self.stats['after_dedup'] = len(dataset)
        print(f"  完成,保留 {len(dataset)} 条\n")

        # 步骤4:质量过滤
        if config and config.get('quality_filter', False):
            print("步骤4: 质量过滤...")
            texts = [s.get('output', '') for s in dataset]
            quality_scores = self.quality_classifier.predict(texts)
            threshold = config.get('quality_threshold', 0.5)

            dataset = [
                s for s, score in zip(dataset, quality_scores)
                if score >= threshold
            ]
            self.stats['after_quality'] = len(dataset)
            print(f"  完成,保留 {len(dataset)} 条\n")

        # 步骤5:异常检测
        print("步骤5: 异常检测...")
        texts = [s.get('output', '') for s in dataset]
        anomalies = self.anomaly_detector.detect_all(texts)
        anomaly_indices = set(a['index'] for a in anomalies)

        dataset = [
            s for i, s in enumerate(dataset)
            if i not in anomaly_indices
        ]
        self.stats['after_anomaly'] = len(dataset)
        print(f"  完成,移除 {len(anomaly_indices)} 个异常,保留 {len(dataset)} 条\n")

        self.stats['final_count'] = len(dataset)

        # 打印统计
        self.print_stats()

        return dataset

    def print_stats(self):
        """打印处理统计"""
        print("="*50)
        print("数据处理统计")
        print("="*50)

        original = self.stats['original_count']

        steps = [
            ('原始数据', 'original_count'),
            ('格式标准化后', 'after_format'),
            ('噪声过滤后', 'after_noise'),
            ('去重后', 'after_dedup'),
            ('质量过滤后', 'after_quality'),
            ('异常检测后', 'after_anomaly'),
        ]

        for name, key in steps:
            count = self.stats.get(key, 0)
            if count > 0:
                ratio = count / original * 100
                print(f"{name:12s}: {count:6d} ({ratio:5.1f}%)")

        final = self.stats['final_count']
        removed = original - final
        print(f"\n总计移除: {removed} ({removed/original*100:.1f}%)")
        print(f"最终数据: {final} ({final/original*100:.1f}%)")
        print("="*50)

# 使用
pipeline = DataProcessingPipeline()

raw_dataset = [
    # ... 原始数据
]

config = {
    'quality_filter': True,
    'quality_threshold': 0.7
}

processed_dataset = pipeline.process(raw_dataset, config)

数据质量监控

class DataQualityMonitor:
    """数据质量监控"""

    def __init__(self):
        self.metrics = {}

    def compute_metrics(self, dataset):
        """计算质量指标"""
        texts = [s.get('output', '') for s in dataset]

        # 1. 基本统计
        lengths = [len(t) for t in texts]
        self.metrics['count'] = len(dataset)
        self.metrics['avg_length'] = np.mean(lengths)
        self.metrics['std_length'] = np.std(lengths)
        self.metrics['min_length'] = np.min(lengths)
        self.metrics['max_length'] = np.max(lengths)

        # 2. 多样性
        diversity = measure_diversity(texts)
        self.metrics.update(diversity)

        # 3. 毒性率
        detector = ToxicityDetector()
        toxic_count = sum(1 for t in texts if detector.detect(t)['is_toxic'])
        self.metrics['toxicity_rate'] = toxic_count / len(texts)

        # 4. 完整性
        complete_count = sum(
            1 for s in dataset
            if all(s.get(f, '') for f in ['instruction', 'input', 'output'])
        )
        self.metrics['completeness_rate'] = complete_count / len(dataset)

        return self.metrics

    def report(self):
        """生成质量报告"""
        print("\n" + "="*60)
        print("数据质量报告")
        print("="*60)

        print(f"\n基本统计:")
        print(f"  样本数量: {self.metrics['count']}")
        print(f"  平均长度: {self.metrics['avg_length']:.1f} 字符")
        print(f"  长度标准差: {self.metrics['std_length']:.1f}")
        print(f"  长度范围: [{self.metrics['min_length']}, {self.metrics['max_length']}]")

        print(f"\n多样性:")
        print(f"  词汇多样性: {self.metrics['lexical_diversity']:.3f}")
        print(f"  语义多样性: {self.metrics['semantic_diversity']:.3f}")
        print(f"  词汇量: {self.metrics['vocabulary_size']}")

        print(f"\n质量指标:")
        print(f"  毒性率: {self.metrics['toxicity_rate']*100:.2f}%")
        print(f"  完整率: {self.metrics['completeness_rate']*100:.2f}%")

        # 评分
        score = self.compute_quality_score()
        print(f"\n综合质量分数: {score:.1f}/100")

        if score >= 80:
            grade = "优秀 ✅"
        elif score >= 60:
            grade = "良好"
        else:
            grade = "需要改进 ⚠️"

        print(f"质量等级: {grade}")
        print("="*60)

    def compute_quality_score(self):
        """计算综合质量分数(0-100)"""
        score = 0

        # 多样性(40分)
        score += self.metrics['lexical_diversity'] * 20
        score += self.metrics['semantic_diversity'] * 20

        # 完整性(30分)
        score += self.metrics['completeness_rate'] * 30

        # 安全性(30分)
        score += (1 - self.metrics['toxicity_rate']) * 30

        return min(100, score)

# 使用
monitor = DataQualityMonitor()
metrics = monitor.compute_metrics(processed_dataset)
monitor.report()

小结

核心要点

1. 数据质量的重要性

  • 质量 > 数量
  • 高质量数据可以减少训练成本50%,提升性能25-130%
  • "Garbage in, garbage out"

2. 六个质量维度

准确性 → 内容正确
完整性 → 信息充分
一致性 → 格式统一
多样性 → 覆盖广泛
相关性 → 符合目标
安全性 → 无害无毒

3. 核心处理技术

清洗

  • 格式标准化
  • 噪声过滤
  • 去重(精确/模糊/语义)
  • 异常检测

过滤

  • 基于规则
  • 基于困惑度
  • 基于分类器
  • 毒性检测

增强

  • 回译
  • 同义词替换
  • 合成数据

4. 工程实践

完整的数据处理Pipeline:

原始数据
  ↓
格式标准化
  ↓
噪声过滤
  ↓
去重
  ↓
质量过滤
  ↓
异常检测
  ↓
质量监控
  ↓
高质量数据

实践建议

1. 分阶段处理

阶段1: 快速清洗(去除明显噪声)
  → 移除HTML、URL、特殊字符
  → 格式标准化
  → 精确去重

阶段2: 深度过滤(提升质量)
  → 模糊去重
  → 困惑度过滤
  → 质量分类

阶段3: 精细优化(针对性处理)
  → 领域相关性过滤
  → 毒性检测
  → 数据增强

2. 质量优先级

高优先级(必须做):
✅ 去重(避免记忆)
✅ 毒性检测(安全)
✅ 格式标准化(一致性)

中优先级(推荐做):
⭕ 困惑度过滤(质量)
⭕ 相关性过滤(效率)
⭕ 长度过滤(稳定性)

低优先级(可选):
○ 数据增强(当数据不足时)
○ 复杂的语义分析

3. 监控指标

必监控:
- 样本数量变化
- 平均长度
- 去重率

推荐监控:
- 词汇多样性
- 语义多样性
- 毒性率
- 完整率

高级监控:
- 困惑度分布
- 质量分数分布
- 各维度详细指标

4. 常见陷阱

过度过滤

问题:过滤太严格,数据量大幅减少
后果:模型训练不充分
建议:先宽松过滤,逐步调整阈值

忽略多样性

问题:只关注质量,忽略多样性
后果:模型泛化能力差
建议:平衡质量和多样性

一次性处理

问题:一次性应用所有过滤规则
后果:难以定位问题
建议:分步处理,每步验证

质量检查清单

处理前

  • 了解数据来源和特点
  • 定义质量标准
  • 设计处理pipeline
  • 准备验证数据集

处理中

  • 每步记录数据量变化
  • 采样检查中间结果
  • 监控关键质量指标
  • 保存处理日志

处理后

  • 生成质量报告
  • 人工抽查样本
  • 对比前后效果
  • 评估训练效果

记住:数据质量工程是一个迭代过程,需要不断优化和调整!