开篇:数据质量决定模型上限
在大模型时代,有一句话越来越被验证:"Garbage in, garbage out"(垃圾进,垃圾出)。
真实案例:数据质量的巨大影响
案例1:GPT-3到GPT-3.5的飞跃
GPT-3(2020):
预训练数据:300B tokens(CommonCrawl为主)
效果:理解能力强,但经常产生无意义内容
GPT-3.5(2022):
预训练数据:类似规模,但经过严格过滤
+ 高质量指令数据(InstructGPT)
效果:质的飞跃,可用性大幅提升 ✅
关键差异:不是数据量,而是数据质量!
案例2:LLaMA的成功秘诀
Meta的LLaMA论文揭示:
| 模型 | 数据量 | 数据质量策略 | 性能 |
|---|---|---|---|
| OPT-175B | 180B tokens | 标准清洗 | 基准 |
| LLaMA-65B | 1.4T tokens | 严格过滤+去重 | 超越OPT ✅ |
LLaMA用更小的模型(65B vs 175B),通过更多高质量数据,达到更好效果。
案例3:低质量数据的灾难
场景:用Reddit数据训练对话模型
原始数据(未过滤):
"你好" → "滚"
"怎么办?" → "关我屁事"
"谢谢" → "傻X"
训练后的模型:
用户:"你好"
模型:"滚开,别烦我" ❌
问题:模型学到了有毒、不友好的表达
数据质量的量化影响
实验数据(7B模型,在不同质量数据上训练):
| 数据类型 | 数据量 | MMLU | HumanEval | AlpacaEval | 训练成本 |
|---|---|---|---|---|---|
| 原始爬虫数据 | 1T tokens | 45.2% | 12.3% | 52.1% | 100% |
| 基础清洗 | 800B tokens | 52.8% | 18.5% | 65.3% | 80% |
| 高质量过滤 | 500B tokens | 61.5% | 28.7% | 78.9% | 50% ✅ |
观察:
- 数据量减少50%
- 训练成本减半
- 性能提升25-130%
结论:质量 > 数量!
第一部分:数据质量的六个维度
维度1:准确性(Accuracy)
定义:数据内容的正确性和真实性。
问题示例:
# 错误的事实
"埃菲尔铁塔位于英国伦敦" ❌
"1+1=3" ❌
"Python是1995年发明的" ❌ (实际是1991年)
# 正确的事实
"埃菲尔铁塔位于法国巴黎" ✅
"1+1=2" ✅
"Python是1991年发明的" ✅
检测方法:
import re
def check_factual_accuracy(text):
"""简单的事实检查"""
errors = []
# 规则1:检查基本数学
math_pattern = r'(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)'
for match in re.finditer(math_pattern, text):
a, b, result = map(int, match.groups())
if a + b != result:
errors.append(f"数学错误: {match.group()}")
# 规则2:检查常识性错误
false_facts = [
(r'埃菲尔铁塔.*伦敦', '埃菲尔铁塔在巴黎,不在伦敦'),
(r'中国.*首都.*上海', '中国首都是北京,不是上海'),
]
for pattern, error_msg in false_facts:
if re.search(pattern, text):
errors.append(error_msg)
return errors
# 测试
text = "埃菲尔铁塔位于伦敦,2+2=5"
errors = check_factual_accuracy(text)
print(f"发现 {len(errors)} 个错误:")
for error in errors:
print(f" - {error}")
实际影响:
| 数据准确率 | 模型幻觉率 | 可信度 |
|---|---|---|
| 60% | 45% ❌ | 低 |
| 80% | 25% ⚠️ | 中 |
| 95% | 8% ✅ | 高 |
| 99% | 3% ✅✅ | 很高 |
维度2:完整性(Completeness)
定义:数据是否包含必要的信息。
问题示例:
# 不完整的数据
incomplete_data = {
"instruction": "总结这篇文章",
"input": "", # ❌ 缺少文章内容
"output": "这篇文章讨论了..."
}
# 完整的数据
complete_data = {
"instruction": "总结这篇文章",
"input": "人工智能是...[完整文章]", # ✅ 包含必要信息
"output": "这篇文章讨论了人工智能的发展..."
}
检测方法:
def check_completeness(sample):
"""检查数据完整性"""
issues = []
# 检查必需字段
required_fields = ['instruction', 'input', 'output']
for field in required_fields:
if field not in sample:
issues.append(f"缺少字段: {field}")
elif not sample[field] or sample[field].strip() == '':
issues.append(f"字段为空: {field}")
# 检查长度
if 'output' in sample and len(sample['output']) < 10:
issues.append("输出过短,可能不完整")
# 检查截断
if sample.get('output', '').endswith('...'):
issues.append("输出可能被截断")
return issues
# 测试
samples = [
{"instruction": "翻译", "input": "Hello", "output": "你好"},
{"instruction": "总结", "input": "", "output": "文章讲述了..."},
{"instruction": "写作", "input": "写一篇文章", "output": "标题:..."}
]
for i, sample in enumerate(samples):
issues = check_completeness(sample)
if issues:
print(f"样本 {i} 问题: {issues}")
维度3:一致性(Consistency)
定义:数据格式和语义的统一性。
问题示例:
# 格式不一致
inconsistent_data = [
{"question": "什么是AI?", "answer": "人工智能是..."},
{"query": "AI是什么?", "response": "..."}, # ❌ 字段名不同
{"q": "AI定义", "a": "..."} # ❌ 字段名缩写
]
# 语义不一致
semantic_inconsistency = [
{
"input": "北京的天气",
"output": "上海今天晴天" # ❌ 答非所问
},
{
"input": "1+1等于几?",
"output": "Python是一种编程语言" # ❌ 无关回答
}
]
检测方法:
from collections import Counter
def check_consistency(dataset):
"""检查数据集一致性"""
issues = []
# 1. 检查字段一致性
field_names = [set(sample.keys()) for sample in dataset]
field_counter = Counter([frozenset(fields) for fields in field_names])
if len(field_counter) > 1:
issues.append(f"发现 {len(field_counter)} 种不同的字段组合")
print("字段组合:")
for fields, count in field_counter.most_common():
print(f" {set(fields)}: {count} 个样本")
# 2. 检查格式一致性
date_formats = []
for sample in dataset:
text = str(sample)
if re.search(r'\d{4}-\d{2}-\d{2}', text):
date_formats.append('YYYY-MM-DD')
elif re.search(r'\d{2}/\d{2}/\d{4}', text):
date_formats.append('MM/DD/YYYY')
if len(set(date_formats)) > 1:
issues.append(f"日期格式不一致: {set(date_formats)}")
return issues
# 测试
mixed_dataset = [
{"question": "Q1", "answer": "A1"},
{"query": "Q2", "response": "A2"},
{"q": "Q3", "a": "A3"}
]
issues = check_consistency(mixed_dataset)
print(f"一致性问题: {issues}")
维度4:多样性(Diversity)
定义:数据涵盖的主题、风格、难度的广泛性。
问题示例:
# 缺乏多样性(所有样本都是简单的加法)
low_diversity = [
{"input": "1+1", "output": "2"},
{"input": "2+2", "output": "4"},
{"input": "3+3", "output": "6"},
# ... 1000个类似样本
]
# 问题:模型只会简单加法,不会乘法、除法
# 高多样性
high_diversity = [
{"input": "1+1", "output": "2"}, # 加法
{"input": "3×4", "output": "12"}, # 乘法
{"input": "求解 x²=9", "output": "x=±3"}, # 方程
{"input": "微积分:∫x dx", "output": "x²/2+C"}, # 高级数学
]
评估方法:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def measure_diversity(texts):
"""衡量文本多样性"""
# 1. 词汇多样性
all_words = []
for text in texts:
words = text.lower().split()
all_words.extend(words)
vocabulary_size = len(set(all_words))
total_words = len(all_words)
lexical_diversity = vocabulary_size / total_words
# 2. 语义多样性(基于TF-IDF相似度)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)
# 计算平均相似度
similarities = []
n = len(texts)
for i in range(n):
for j in range(i+1, n):
sim = cosine_similarity(tfidf_matrix[i], tfidf_matrix[j])[0][0]
similarities.append(sim)
avg_similarity = np.mean(similarities)
semantic_diversity = 1 - avg_similarity # 相似度越低,多样性越高
return {
'lexical_diversity': lexical_diversity,
'semantic_diversity': semantic_diversity,
'vocabulary_size': vocabulary_size
}
# 测试
low_div_texts = [
"今天天气很好",
"今天天气真好",
"今天天气特别好"
]
high_div_texts = [
"今天天气很好",
"讨论量子物理的基本原理",
"如何制作美味的意大利面"
]
print("低多样性:", measure_diversity(low_div_texts))
print("高多样性:", measure_diversity(high_div_texts))
多样性对模型的影响:
| 多样性指标 | MMLU | 泛化能力 | 过拟合风险 |
|---|---|---|---|
| 低 (<0.3) | 45% | 差 | 高 ❌ |
| 中 (0.3-0.6) | 58% | 中等 | 中等 |
| 高 (>0.6) | 67% | 好 | 低 ✅ |
维度5:相关性(Relevance)
定义:数据与训练目标的相关程度。
问题示例:
# 训练目标:医疗问答模型
# 不相关数据
irrelevant_data = [
{"input": "如何做红烧肉?", "output": "..."}, # ❌ 烹饪
{"input": "股票怎么买?", "output": "..."}, # ❌ 金融
{"input": "Python语法", "output": "..."} # ❌ 编程
]
# 相关数据
relevant_data = [
{"input": "感冒了怎么办?", "output": "..."}, # ✅ 医疗
{"input": "高血压的症状", "output": "..."}, # ✅ 医疗
{"input": "如何预防糖尿病", "output": "..."} # ✅ 医疗
]
检测方法:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
class RelevanceClassifier:
def __init__(self):
self.vectorizer = CountVectorizer()
self.classifier = MultinomialNB()
def train(self, texts, labels):
"""
训练相关性分类器
texts: 文本列表
labels: 标签(1=相关,0=不相关)
"""
X = self.vectorizer.fit_transform(texts)
self.classifier.fit(X, labels)
def predict(self, texts):
"""预测相关性"""
X = self.vectorizer.transform(texts)
probs = self.classifier.predict_proba(X)
return probs[:, 1] # 返回相关的概率
# 训练示例(医疗领域)
train_texts = [
"感冒症状", # 相关
"高血压治疗", # 相关
"红烧肉做法", # 不相关
"股票投资", # 不相关
]
train_labels = [1, 1, 0, 0]
classifier = RelevanceClassifier()
classifier.train(train_texts, train_labels)
# 测试
test_texts = [
"糖尿病预防",
"意大利面的做法",
"心脏病的早期征兆"
]
scores = classifier.predict(test_texts)
for text, score in zip(test_texts, scores):
relevance = "相关" if score > 0.5 else "不相关"
print(f"{text}: {relevance} (置信度: {score:.2f})")
维度6:安全性(Safety)
定义:数据不包含有害、有毒、偏见的内容。
问题示例:
# 有毒内容
toxic_data = [
"你这个傻X,滚", # ❌ 辱骂
"去死吧", # ❌ 暴力
"XX族都是垃圾", # ❌ 种族歧视
]
# 安全内容
safe_data = [
"谢谢你的帮助", # ✅ 友好
"我不太理解这个问题", # ✅ 中性
"让我们一起解决这个问题", # ✅ 积极
]
毒性检测:
import re
class ToxicityDetector:
def __init__(self):
# 简化的毒性词典(实际应用中应该更全面)
self.toxic_keywords = {
'profanity': ['傻X', '去死', '垃圾', '白痴'],
'violence': ['杀', '打死', '暴力'],
'discrimination': ['歧视', '低等', '劣等']
}
def detect(self, text):
"""检测文本毒性"""
results = {
'is_toxic': False,
'toxic_types': [],
'toxic_words': []
}
for toxic_type, keywords in self.toxic_keywords.items():
for keyword in keywords:
if keyword in text:
results['is_toxic'] = True
results['toxic_types'].append(toxic_type)
results['toxic_words'].append(keyword)
return results
def compute_toxicity_score(self, text):
"""计算毒性分数(0-1)"""
detection = self.detect(text)
if not detection['is_toxic']:
return 0.0
# 根据毒性类型和数量计算分数
base_score = 0.3
type_penalty = len(detection['toxic_types']) * 0.2
word_penalty = len(detection['toxic_words']) * 0.1
return min(1.0, base_score + type_penalty + word_penalty)
# 使用
detector = ToxicityDetector()
test_texts = [
"今天天气很好",
"你这个傻X",
"去死吧,垃圾"
]
for text in test_texts:
result = detector.detect(text)
score = detector.compute_toxicity_score(text)
print(f"文本: {text}")
print(f" 有毒: {result['is_toxic']}")
print(f" 类型: {result['toxic_types']}")
print(f" 毒性分数: {score:.2f}\n")
实际工具推荐:
# 使用Perspective API(Google)
# pip install google-api-python-client
from googleapiclient import discovery
def check_toxicity_perspective(text, api_key):
"""使用Perspective API检测毒性"""
client = discovery.build(
"commentanalyzer",
"v1alpha1",
developerKey=api_key,
discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1"
)
analyze_request = {
'comment': {'text': text},
'requestedAttributes': {
'TOXICITY': {},
'SEVERE_TOXICITY': {},
'IDENTITY_ATTACK': {},
'INSULT': {},
'PROFANITY': {},
'THREAT': {}
}
}
response = client.comments().analyze(body=analyze_request).execute()
scores = {}
for attr, data in response['attributeScores'].items():
scores[attr] = data['summaryScore']['value']
return scores
第二部分:数据清洗
清洗策略1:格式标准化
问题:数据格式五花八门,需要统一。
class DataFormatter:
"""数据格式标准化"""
def __init__(self):
self.target_format = {
'instruction': str,
'input': str,
'output': str
}
def normalize_whitespace(self, text):
"""标准化空白字符"""
# 移除多余空格
text = re.sub(r'\s+', ' ', text)
# 移除首尾空格
text = text.strip()
return text
def normalize_punctuation(self, text):
"""标准化标点符号"""
# 中文标点 → 英文标点
replacements = {
',': ',',
'。': '.',
'!': '!',
'?': '?',
':': ':',
';': ';',
'"': '"',
'"': '"',
''': "'",
''': "'"
}
for old, new in replacements.items():
text = text.replace(old, new)
return text
def normalize_sample(self, sample):
"""标准化单个样本"""
normalized = {}
# 字段名标准化
field_mapping = {
'question': 'instruction',
'query': 'instruction',
'prompt': 'instruction',
'answer': 'output',
'response': 'output',
'completion': 'output',
'context': 'input',
'text': 'input'
}
for key, value in sample.items():
# 映射字段名
normalized_key = field_mapping.get(key, key)
# 标准化文本
if isinstance(value, str):
value = self.normalize_whitespace(value)
value = self.normalize_punctuation(value)
normalized[normalized_key] = value
# 确保必需字段存在
for field in self.target_format:
if field not in normalized:
normalized[field] = ""
return normalized
# 使用示例
formatter = DataFormatter()
messy_data = [
{
'question': ' 什么是AI? ', # 多余空格
'answer': 'AI是人工智能。。。' # 中文标点
},
{
'query': 'Python是什么',
'response': 'Python是编程语言'
}
]
cleaned_data = [formatter.normalize_sample(sample) for sample in messy_data]
for sample in cleaned_data:
print(sample)
清洗策略2:噪声过滤
常见噪声类型:
class NoiseFilter:
"""噪声过滤器"""
def filter_html(self, text):
"""移除HTML标签"""
# <p>Hello</p> → Hello
text = re.sub(r'<[^>]+>', '', text)
# → 空格
text = re.sub(r'&\w+;', ' ', text)
return text
def filter_urls(self, text):
"""移除URL"""
# http://example.com → [URL]
text = re.sub(
r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
'[URL]',
text
)
return text
def filter_special_chars(self, text):
"""移除特殊字符"""
# 保留:字母、数字、常用标点、中文
text = re.sub(r'[^\w\s,.!?;:()(),。!?;:\u4e00-\u9fff]', '', text)
return text
def filter_repeated_chars(self, text):
"""处理重复字符"""
# "哈哈哈哈哈哈" → "哈哈哈"
text = re.sub(r'(.)\1{3,}', r'\1\1\1', text)
return text
def filter_noise(self, text):
"""综合过滤"""
text = self.filter_html(text)
text = self.filter_urls(text)
text = self.filter_special_chars(text)
text = self.filter_repeated_chars(text)
return text
# 测试
filter = NoiseFilter()
noisy_text = """
<p>访问 https://example.com 了解更多!!!!</p>
哈哈哈哈哈哈哈哈哈 太好笑了@#$%
"""
cleaned = filter.filter_noise(noisy_text)
print(f"原始: {noisy_text}")
print(f"清洗: {cleaned}")
清洗策略3:去重
方法1:精确去重
def exact_dedup(dataset):
"""精确去重"""
seen = set()
deduped = []
for sample in dataset:
# 将样本转换为可哈希的字符串
key = f"{sample['instruction']}||{sample['input']}||{sample['output']}"
if key not in seen:
seen.add(key)
deduped.append(sample)
removed = len(dataset) - len(deduped)
print(f"移除 {removed} 个重复样本 ({removed/len(dataset)*100:.1f}%)")
return deduped
方法2:模糊去重(MinHash)
from datasketch import MinHash, MinHashLSH
class FuzzyDeduplicator:
"""基于MinHash的模糊去重"""
def __init__(self, threshold=0.8, num_perm=128):
"""
Args:
threshold: 相似度阈值(0-1),超过则认为重复
num_perm: MinHash的排列数(越大越准确但越慢)
"""
self.threshold = threshold
self.num_perm = num_perm
self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
def create_minhash(self, text):
"""为文本创建MinHash签名"""
# 分词(简单按空格分)
tokens = text.lower().split()
# 创建MinHash
minhash = MinHash(num_perm=self.num_perm)
for token in tokens:
minhash.update(token.encode('utf8'))
return minhash
def deduplicate(self, texts):
"""去重文本列表"""
unique_texts = []
unique_indices = []
for idx, text in enumerate(texts):
minhash = self.create_minhash(text)
# 检查是否有相似的已存在
similar = self.lsh.query(minhash)
if not similar:
# 没有相似的,添加
self.lsh.insert(f"doc_{idx}", minhash)
unique_texts.append(text)
unique_indices.append(idx)
return unique_texts, unique_indices
# 使用示例
texts = [
"今天天气很好",
"今天天气真不错", # 与第1条相似
"量子计算的原理",
"今天的天气非常好", # 与第1条相似
"机器学习的应用"
]
dedup = FuzzyDeduplicator(threshold=0.7)
unique_texts, indices = dedup.deduplicate(texts)
print(f"原始: {len(texts)} 条")
print(f"去重后: {len(unique_texts)} 条")
print(f"移除: {len(texts) - len(unique_texts)} 条")
print(f"\n保留的文本:")
for idx, text in zip(indices, unique_texts):
print(f" [{idx}] {text}")
清洗策略4:异常检测
class AnomalyDetector:
"""异常样本检测"""
def check_length_anomaly(self, texts, z_threshold=3.0):
"""检测长度异常"""
lengths = [len(text) for text in texts]
mean_len = np.mean(lengths)
std_len = np.std(lengths)
anomalies = []
for idx, length in enumerate(lengths):
z_score = abs((length - mean_len) / std_len)
if z_score > z_threshold:
anomalies.append({
'index': idx,
'length': length,
'z_score': z_score,
'reason': '长度异常'
})
return anomalies
def check_repetition_anomaly(self, text):
"""检测重复异常"""
# 检测句子级别重复
sentences = text.split('.')
if len(sentences) > 1:
unique_sentences = len(set(sentences))
repetition_ratio = 1 - (unique_sentences / len(sentences))
if repetition_ratio > 0.5:
return True, repetition_ratio
return False, 0.0
def check_quality_anomaly(self, text):
"""检测质量异常"""
issues = []
# 1. 过短
if len(text) < 10:
issues.append("文本过短")
# 2. 全是标点
if len(re.findall(r'[^\w\s]', text)) / len(text) > 0.5:
issues.append("标点过多")
# 3. 大写字母过多
if sum(1 for c in text if c.isupper()) / len(text) > 0.5:
issues.append("大写字母过多")
# 4. 数字过多
if sum(1 for c in text if c.isdigit()) / len(text) > 0.7:
issues.append("数字过多")
return len(issues) > 0, issues
def detect_all(self, texts):
"""检测所有异常"""
all_anomalies = []
# 1. 长度异常
length_anomalies = self.check_length_anomaly(texts)
all_anomalies.extend(length_anomalies)
# 2. 其他异常
for idx, text in enumerate(texts):
# 重复检测
is_rep, rep_ratio = self.check_repetition_anomaly(text)
if is_rep:
all_anomalies.append({
'index': idx,
'repetition_ratio': rep_ratio,
'reason': '内容重复'
})
# 质量检测
is_bad, issues = self.check_quality_anomaly(text)
if is_bad:
all_anomalies.append({
'index': idx,
'issues': issues,
'reason': '质量问题'
})
return all_anomalies
# 使用
detector = AnomalyDetector()
test_texts = [
"正常的文本内容,长度适中。" * 10, # 正常
"短", # 异常:过短
"!!!!!!!!!!!!!!!!!!!!!", # 异常:全是标点
"这是一句话。这是一句话。这是一句话。这是一句话。", # 异常:重复
"1234567890123456789012345", # 异常:数字过多
]
anomalies = detector.detect_all(test_texts)
print(f"检测到 {len(anomalies)} 个异常:")
for anomaly in anomalies:
print(f" 索引 {anomaly['index']}: {anomaly['reason']}")
第三部分:数据过滤
过滤策略1:基于规则
class RuleBasedFilter:
"""基于规则的过滤器"""
def __init__(self):
self.rules = []
def add_rule(self, name, check_func, action='remove'):
"""添加过滤规则"""
self.rules.append({
'name': name,
'check': check_func,
'action': action
})
def filter_dataset(self, dataset):
"""应用所有规则过滤数据集"""
filtered = []
stats = {rule['name']: 0 for rule in self.rules}
for sample in dataset:
keep = True
for rule in self.rules:
if not rule['check'](sample):
stats[rule['name']] += 1
if rule['action'] == 'remove':
keep = False
break
if keep:
filtered.append(sample)
print("过滤统计:")
for rule_name, count in stats.items():
print(f" {rule_name}: {count} 个样本")
print(f"\n保留: {len(filtered)}/{len(dataset)} ({len(filtered)/len(dataset)*100:.1f}%)")
return filtered
# 定义规则
filter = RuleBasedFilter()
# 规则1:最小长度
filter.add_rule(
name="最小长度检查",
check_func=lambda s: len(s.get('output', '')) >= 10
)
# 规则2:不包含有毒词汇
toxic_words = ['傻X', '去死', '垃圾']
filter.add_rule(
name="毒性检查",
check_func=lambda s: not any(word in s.get('output', '') for word in toxic_words)
)
# 规则3:输入输出不能完全相同
filter.add_rule(
name="输入输出不同",
check_func=lambda s: s.get('input', '') != s.get('output', '')
)
# 规则4:必须包含字母或中文
filter.add_rule(
name="包含有效字符",
check_func=lambda s: re.search(r'[a-zA-Z\u4e00-\u9fff]', s.get('output', '')) is not None
)
# 测试
test_dataset = [
{'input': 'Q1', 'output': '这是一个正常的回答'},
{'input': 'Q2', 'output': '短'}, # 太短
{'input': 'Q3', 'output': '你这个傻X'}, # 有毒
{'input': 'Q4', 'output': 'Q4'}, # 输入输出相同
{'input': 'Q5', 'output': '12345!!!'}, # 无有效字符
{'input': 'Q6', 'output': '另一个正常的回答'},
]
filtered_dataset = filter.filter_dataset(test_dataset)
过滤策略2:基于困惑度(Perplexity)
原理:使用语言模型计算困惑度,过滤掉困惑度过高(低质量)的数据。
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class PerplexityFilter:
"""基于困惑度的过滤器"""
def __init__(self, model_name='gpt2', threshold=100.0):
self.tokenizer = GPT2Tokenizer.from_pretrained(model_name)
self.model = GPT2LMHeadModel.from_pretrained(model_name)
self.model.eval()
self.threshold = threshold
def compute_perplexity(self, text):
"""计算文本的困惑度"""
# Tokenize
inputs = self.tokenizer(text, return_tensors='pt')
input_ids = inputs['input_ids']
with torch.no_grad():
outputs = self.model(input_ids, labels=input_ids)
loss = outputs.loss
# Perplexity = exp(loss)
perplexity = torch.exp(loss).item()
return perplexity
def filter_by_perplexity(self, texts):
"""根据困惑度过滤"""
filtered = []
perplexities = []
for text in texts:
ppl = self.compute_perplexity(text)
perplexities.append(ppl)
if ppl <= self.threshold:
filtered.append(text)
print(f"困惑度统计:")
print(f" 平均: {np.mean(perplexities):.2f}")
print(f" 中位数: {np.median(perplexities):.2f}")
print(f" 最大: {np.max(perplexities):.2f}")
print(f" 最小: {np.min(perplexities):.2f}")
print(f"\n过滤结果: 保留 {len(filtered)}/{len(texts)}")
return filtered, perplexities
# 使用(需要GPU和transformers库)
# filter = PerplexityFilter(threshold=100.0)
#
# test_texts = [
# "This is a normal English sentence.", # 低困惑度
# "asdkfjal;skdjf;laksjdf;lkj", # 高困惑度(乱码)
# "The quick brown fox jumps over the lazy dog.", # 低困惑度
# ]
#
# filtered, ppls = filter.filter_by_perplexity(test_texts)
困惑度阈值选择:
| 数据类型 | 建议阈值 | 说明 |
|---|---|---|
| 高质量文本 | <30 | 书籍、论文 |
| 一般网页 | 30-100 | 新闻、博客 |
| 社交媒体 | 100-300 | Twitter、Reddit |
| 过滤噪声 | >300 | 乱码、spam |
过滤策略3:基于分类器
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
class QualityClassifier:
"""质量分类器"""
def __init__(self):
self.vectorizer = TfidfVectorizer(max_features=1000)
self.classifier = RandomForestClassifier(n_estimators=100)
def train(self, texts, labels):
"""
训练质量分类器
texts: 文本列表
labels: 质量标签(1=高质量,0=低质量)
"""
X = self.vectorizer.fit_transform(texts)
self.classifier.fit(X, labels)
def predict(self, texts):
"""预测文本质量"""
X = self.vectorizer.transform(texts)
probs = self.classifier.predict_proba(X)
return probs[:, 1] # 返回高质量的概率
def filter_by_quality(self, texts, threshold=0.5):
"""根据质量过滤"""
quality_scores = self.predict(texts)
filtered = []
for text, score in zip(texts, quality_scores):
if score >= threshold:
filtered.append(text)
print(f"质量过滤:")
print(f" 平均质量分数: {np.mean(quality_scores):.2f}")
print(f" 保留: {len(filtered)}/{len(texts)} ({len(filtered)/len(texts)*100:.1f}%)")
return filtered, quality_scores
# 训练示例
train_texts = [
"这是一个详细、准确的技术解释...", # 高质量
"内容详实,逻辑清晰,表述专业...", # 高质量
"不知道啊,随便吧", # 低质量
"哈哈哈哈", # 低质量
# ... 更多训练数据
]
train_labels = [1, 1, 0, 0]
clf = QualityClassifier()
clf.train(train_texts, train_labels)
# 过滤新数据
new_texts = [
"详细的技术文档,包含代码示例和原理解释",
"不错",
"全面介绍了机器学习的基础概念和应用场景"
]
filtered, scores = clf.filter_by_quality(new_texts, threshold=0.7)
第四部分:数据增强
增强策略1:回译(Back Translation)
原理:文本 → 翻译到另一语言 → 翻译回原语言
from transformers import MarianMTModel, MarianTokenizer
class BackTranslator:
"""回译数据增强"""
def __init__(self, src_lang='zh', tgt_lang='en'):
# 中文 → 英文
self.forward_model_name = f'Helsinki-NLP/opus-mt-{src_lang}-{tgt_lang}'
self.forward_tokenizer = MarianTokenizer.from_pretrained(self.forward_model_name)
self.forward_model = MarianMTModel.from_pretrained(self.forward_model_name)
# 英文 → 中文
self.backward_model_name = f'Helsinki-NLP/opus-mt-{tgt_lang}-{src_lang}'
self.backward_tokenizer = MarianTokenizer.from_pretrained(self.backward_model_name)
self.backward_model = MarianMTModel.from_pretrained(self.backward_model_name)
def translate(self, text, model, tokenizer):
"""翻译文本"""
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
result = tokenizer.decode(translated[0], skip_special_tokens=True)
return result
def back_translate(self, text):
"""回译"""
# 步骤1:中文 → 英文
en_text = self.translate(text, self.forward_model, self.forward_tokenizer)
# 步骤2:英文 → 中文
zh_text = self.translate(en_text, self.backward_model, self.backward_tokenizer)
return zh_text, en_text
def augment_dataset(self, texts, num_augments=1):
"""增强数据集"""
augmented = []
for text in texts:
# 保留原文
augmented.append(text)
# 生成增强样本
for _ in range(num_augments):
aug_text, _ = self.back_translate(text)
augmented.append(aug_text)
return augmented
# 使用(需要较长时间下载模型)
# bt = BackTranslator()
#
# original = "今天天气很好"
# augmented, intermediate = bt.back_translate(original)
#
# print(f"原文: {original}")
# print(f"中间(英文): {intermediate}")
# print(f"回译: {augmented}")
# 简化版本(不使用模型)
def simple_back_translate(text):
"""简化的回译模拟"""
# 同义词替换
replacements = {
'很': ['非常', '特别', '十分'],
'好': ['不错', '棒', '优秀'],
'天气': ['气候', '天色'],
}
import random
augmented = text
for old, new_list in replacements.items():
if old in augmented:
augmented = augmented.replace(old, random.choice(new_list))
return augmented
# 测试简化版
original_texts = [
"今天天气很好",
"这个方法很有效"
]
for text in original_texts:
aug = simple_back_translate(text)
print(f"原文: {text}")
print(f"增强: {aug}\n")
增强策略2:同义词替换
import random
class SynonymAugmentor:
"""同义词替换增强"""
def __init__(self):
# 简化的同义词词典
self.synonyms = {
'好': ['不错', '很棒', '优秀', '出色'],
'坏': ['糟糕', '差劲', '恶劣'],
'大': ['巨大', '庞大', '宏大'],
'小': ['微小', '细小', '渺小'],
'快': ['迅速', '快速', '敏捷'],
'慢': ['缓慢', '迟缓'],
'方法': ['方式', '途径', '手段'],
'问题': ['疑问', '难题', '课题'],
}
def replace_with_synonym(self, text, replace_ratio=0.3):
"""用同义词替换文本中的词"""
words = list(text)
# 找出可替换的词
replaceable = []
for i, word in enumerate(words):
if word in self.synonyms:
replaceable.append(i)
# 随机选择一部分进行替换
num_replace = max(1, int(len(replaceable) * replace_ratio))
if replaceable:
indices = random.sample(replaceable, min(num_replace, len(replaceable)))
for idx in indices:
word = words[idx]
words[idx] = random.choice(self.synonyms[word])
return ''.join(words)
def augment(self, texts, num_augments=2):
"""增强数据集"""
augmented = []
for text in texts:
augmented.append(text) # 保留原文
for _ in range(num_augments):
aug_text = self.replace_with_synonym(text)
augmented.append(aug_text)
return augmented
# 使用
augmentor = SynonymAugmentor()
original_texts = [
"这是一个很好的方法",
"解决这个问题很快"
]
augmented = augmentor.augment(original_texts, num_augments=2)
print("增强后的数据集:")
for i, text in enumerate(augmented):
print(f"{i+1}. {text}")
增强策略3:合成数据
class SyntheticDataGenerator:
"""合成数据生成器"""
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def generate_from_prompt(self, prompt, num_samples=5, max_length=100):
"""从提示生成数据"""
generated = []
for _ in range(num_samples):
inputs = self.tokenizer(prompt, return_tensors='pt')
outputs = self.model.generate(
**inputs,
max_length=max_length,
num_return_sequences=1,
temperature=0.8,
top_p=0.9,
do_sample=True
)
text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
generated.append(text)
return generated
def generate_qa_pairs(self, context, num_pairs=5):
"""基于上下文生成问答对"""
prompt = f"""基于以下文本,生成{num_pairs}个问答对:
文本:{context}
问答对:
"""
generated = self.generate_from_prompt(prompt, num_samples=1)
# 解析生成的问答对(简化版)
qa_pairs = []
# ... 解析逻辑
return qa_pairs
# 模板化合成数据
class TemplateGenerator:
"""基于模板的数据生成"""
def __init__(self):
self.templates = {
'math': [
"{a} + {b} = {result}",
"{a} - {b} = {result}",
"{a} × {b} = {result}",
],
'comparison': [
"{item1} 比 {item2} {attribute}",
"{item1} 和 {item2} 相比,更 {attribute}",
]
}
def generate_math_data(self, num_samples=100):
"""生成数学问题"""
import random
data = []
for _ in range(num_samples):
a = random.randint(1, 100)
b = random.randint(1, 100)
op = random.choice(['+', '-', '×'])
if op == '+':
result = a + b
question = f"{a} + {b} = ?"
elif op == '-':
result = a - b
question = f"{a} - {b} = ?"
else:
result = a * b
question = f"{a} × {b} = ?"
data.append({
'input': question,
'output': str(result)
})
return data
# 使用
gen = TemplateGenerator()
math_data = gen.generate_math_data(num_samples=10)
print("合成的数学数据:")
for i, item in enumerate(math_data[:5]):
print(f"{i+1}. {item['input']} → {item['output']}")
第五部分:工程实践
数据处理Pipeline
class DataProcessingPipeline:
"""完整的数据处理流水线"""
def __init__(self):
self.formatter = DataFormatter()
self.noise_filter = NoiseFilter()
self.quality_classifier = QualityClassifier()
self.deduplicator = FuzzyDeduplicator()
self.anomaly_detector = AnomalyDetector()
self.stats = {
'original_count': 0,
'after_format': 0,
'after_noise': 0,
'after_dedup': 0,
'after_quality': 0,
'after_anomaly': 0,
'final_count': 0
}
def process(self, dataset, config=None):
"""
处理数据集
Args:
dataset: 原始数据集
config: 配置参数
"""
self.stats['original_count'] = len(dataset)
print(f"原始数据: {len(dataset)} 条\n")
# 步骤1:格式标准化
print("步骤1: 格式标准化...")
dataset = [self.formatter.normalize_sample(s) for s in dataset]
self.stats['after_format'] = len(dataset)
print(f" 完成,保留 {len(dataset)} 条\n")
# 步骤2:噪声过滤
print("步骤2: 噪声过滤...")
for sample in dataset:
if 'output' in sample:
sample['output'] = self.noise_filter.filter_noise(sample['output'])
self.stats['after_noise'] = len(dataset)
print(f" 完成\n")
# 步骤3:去重
print("步骤3: 去重...")
texts = [s.get('output', '') for s in dataset]
unique_texts, unique_indices = self.deduplicator.deduplicate(texts)
dataset = [dataset[i] for i in unique_indices]
self.stats['after_dedup'] = len(dataset)
print(f" 完成,保留 {len(dataset)} 条\n")
# 步骤4:质量过滤
if config and config.get('quality_filter', False):
print("步骤4: 质量过滤...")
texts = [s.get('output', '') for s in dataset]
quality_scores = self.quality_classifier.predict(texts)
threshold = config.get('quality_threshold', 0.5)
dataset = [
s for s, score in zip(dataset, quality_scores)
if score >= threshold
]
self.stats['after_quality'] = len(dataset)
print(f" 完成,保留 {len(dataset)} 条\n")
# 步骤5:异常检测
print("步骤5: 异常检测...")
texts = [s.get('output', '') for s in dataset]
anomalies = self.anomaly_detector.detect_all(texts)
anomaly_indices = set(a['index'] for a in anomalies)
dataset = [
s for i, s in enumerate(dataset)
if i not in anomaly_indices
]
self.stats['after_anomaly'] = len(dataset)
print(f" 完成,移除 {len(anomaly_indices)} 个异常,保留 {len(dataset)} 条\n")
self.stats['final_count'] = len(dataset)
# 打印统计
self.print_stats()
return dataset
def print_stats(self):
"""打印处理统计"""
print("="*50)
print("数据处理统计")
print("="*50)
original = self.stats['original_count']
steps = [
('原始数据', 'original_count'),
('格式标准化后', 'after_format'),
('噪声过滤后', 'after_noise'),
('去重后', 'after_dedup'),
('质量过滤后', 'after_quality'),
('异常检测后', 'after_anomaly'),
]
for name, key in steps:
count = self.stats.get(key, 0)
if count > 0:
ratio = count / original * 100
print(f"{name:12s}: {count:6d} ({ratio:5.1f}%)")
final = self.stats['final_count']
removed = original - final
print(f"\n总计移除: {removed} ({removed/original*100:.1f}%)")
print(f"最终数据: {final} ({final/original*100:.1f}%)")
print("="*50)
# 使用
pipeline = DataProcessingPipeline()
raw_dataset = [
# ... 原始数据
]
config = {
'quality_filter': True,
'quality_threshold': 0.7
}
processed_dataset = pipeline.process(raw_dataset, config)
数据质量监控
class DataQualityMonitor:
"""数据质量监控"""
def __init__(self):
self.metrics = {}
def compute_metrics(self, dataset):
"""计算质量指标"""
texts = [s.get('output', '') for s in dataset]
# 1. 基本统计
lengths = [len(t) for t in texts]
self.metrics['count'] = len(dataset)
self.metrics['avg_length'] = np.mean(lengths)
self.metrics['std_length'] = np.std(lengths)
self.metrics['min_length'] = np.min(lengths)
self.metrics['max_length'] = np.max(lengths)
# 2. 多样性
diversity = measure_diversity(texts)
self.metrics.update(diversity)
# 3. 毒性率
detector = ToxicityDetector()
toxic_count = sum(1 for t in texts if detector.detect(t)['is_toxic'])
self.metrics['toxicity_rate'] = toxic_count / len(texts)
# 4. 完整性
complete_count = sum(
1 for s in dataset
if all(s.get(f, '') for f in ['instruction', 'input', 'output'])
)
self.metrics['completeness_rate'] = complete_count / len(dataset)
return self.metrics
def report(self):
"""生成质量报告"""
print("\n" + "="*60)
print("数据质量报告")
print("="*60)
print(f"\n基本统计:")
print(f" 样本数量: {self.metrics['count']}")
print(f" 平均长度: {self.metrics['avg_length']:.1f} 字符")
print(f" 长度标准差: {self.metrics['std_length']:.1f}")
print(f" 长度范围: [{self.metrics['min_length']}, {self.metrics['max_length']}]")
print(f"\n多样性:")
print(f" 词汇多样性: {self.metrics['lexical_diversity']:.3f}")
print(f" 语义多样性: {self.metrics['semantic_diversity']:.3f}")
print(f" 词汇量: {self.metrics['vocabulary_size']}")
print(f"\n质量指标:")
print(f" 毒性率: {self.metrics['toxicity_rate']*100:.2f}%")
print(f" 完整率: {self.metrics['completeness_rate']*100:.2f}%")
# 评分
score = self.compute_quality_score()
print(f"\n综合质量分数: {score:.1f}/100")
if score >= 80:
grade = "优秀 ✅"
elif score >= 60:
grade = "良好"
else:
grade = "需要改进 ⚠️"
print(f"质量等级: {grade}")
print("="*60)
def compute_quality_score(self):
"""计算综合质量分数(0-100)"""
score = 0
# 多样性(40分)
score += self.metrics['lexical_diversity'] * 20
score += self.metrics['semantic_diversity'] * 20
# 完整性(30分)
score += self.metrics['completeness_rate'] * 30
# 安全性(30分)
score += (1 - self.metrics['toxicity_rate']) * 30
return min(100, score)
# 使用
monitor = DataQualityMonitor()
metrics = monitor.compute_metrics(processed_dataset)
monitor.report()
小结
核心要点
1. 数据质量的重要性
- 质量 > 数量
- 高质量数据可以减少训练成本50%,提升性能25-130%
- "Garbage in, garbage out"
2. 六个质量维度
准确性 → 内容正确
完整性 → 信息充分
一致性 → 格式统一
多样性 → 覆盖广泛
相关性 → 符合目标
安全性 → 无害无毒
3. 核心处理技术
清洗:
- 格式标准化
- 噪声过滤
- 去重(精确/模糊/语义)
- 异常检测
过滤:
- 基于规则
- 基于困惑度
- 基于分类器
- 毒性检测
增强:
- 回译
- 同义词替换
- 合成数据
4. 工程实践
完整的数据处理Pipeline:
原始数据
↓
格式标准化
↓
噪声过滤
↓
去重
↓
质量过滤
↓
异常检测
↓
质量监控
↓
高质量数据
实践建议
1. 分阶段处理
阶段1: 快速清洗(去除明显噪声)
→ 移除HTML、URL、特殊字符
→ 格式标准化
→ 精确去重
阶段2: 深度过滤(提升质量)
→ 模糊去重
→ 困惑度过滤
→ 质量分类
阶段3: 精细优化(针对性处理)
→ 领域相关性过滤
→ 毒性检测
→ 数据增强
2. 质量优先级
高优先级(必须做):
✅ 去重(避免记忆)
✅ 毒性检测(安全)
✅ 格式标准化(一致性)
中优先级(推荐做):
⭕ 困惑度过滤(质量)
⭕ 相关性过滤(效率)
⭕ 长度过滤(稳定性)
低优先级(可选):
○ 数据增强(当数据不足时)
○ 复杂的语义分析
3. 监控指标
必监控:
- 样本数量变化
- 平均长度
- 去重率
推荐监控:
- 词汇多样性
- 语义多样性
- 毒性率
- 完整率
高级监控:
- 困惑度分布
- 质量分数分布
- 各维度详细指标
4. 常见陷阱
❌ 过度过滤:
问题:过滤太严格,数据量大幅减少
后果:模型训练不充分
建议:先宽松过滤,逐步调整阈值
❌ 忽略多样性:
问题:只关注质量,忽略多样性
后果:模型泛化能力差
建议:平衡质量和多样性
❌ 一次性处理:
问题:一次性应用所有过滤规则
后果:难以定位问题
建议:分步处理,每步验证
质量检查清单
处理前:
- 了解数据来源和特点
- 定义质量标准
- 设计处理pipeline
- 准备验证数据集
处理中:
- 每步记录数据量变化
- 采样检查中间结果
- 监控关键质量指标
- 保存处理日志
处理后:
- 生成质量报告
- 人工抽查样本
- 对比前后效果
- 评估训练效果
记住:数据质量工程是一个迭代过程,需要不断优化和调整!