Apriori算法详解
支持度, 置信度, 频繁项集, 关联规则.
满足支持度的就是频繁项集, 这里可以是频繁1项集, 频繁2项集, 频繁3项集,.......
最后在频繁项集中, 挖掘关联规则.
1. 算法定义
Apriori算法是一种经典的关联规则挖掘算法,用于发现大型数据集中项之间的有趣关系(如购物篮分析中的"啤酒与尿布"现象)。其核心目标是找出频繁出现的物品组合(频繁项集),并基于此生成关联规则。
2. 核心概念
- 支持度(Support):项集出现的频率
- 置信度(Confidence):规则A→B的可信度
- 频繁项集:满足最小支持度阈值的物品组合
- 关联规则:形如A→B的规则,需满足最小支持度和置信度
3. 算法原理
基于Apriori性质:"频繁项集的所有非空子集也必须是频繁的"。算法通过逐层搜索(层级法)高效发现频繁项集:
- 连接步骤:将k-1项集连接生成候选k项集
- 剪枝步骤:利用Apriori性质剪枝包含非频繁子集的候选项
4. 算法流程
- 扫描数据集,统计所有1项集的支持度
- 筛选满足最小支持度的频繁1项集L₁
- 循环直到无法生成新项集:
- 由Lₖ生成候选k+1项集Cₖ₊₁
- 扫描数据集计算Cₖ₊₁支持度
- 筛选得到Lₖ₊₁
- 从所有频繁项集生成关联规则
Python代码实现
from itertools import combinations
def load_dataset():
"""创建示例数据集(购物篮交易数据)"""
return [
['牛奶', '面包', '鸡蛋'],
['牛奶', '面包', '薯片', '啤酒'],
['面包', '鸡蛋', '啤酒'],
['牛奶', '鸡蛋', '啤酒'],
['面包', '牛奶', '鸡蛋'],
['面包', '牛奶', '鸡蛋', '啤酒'],
['牛奶', '鸡蛋'],
['面包', '鸡蛋'],
['面包', '牛奶', '鸡蛋'],
['面包', '牛奶', '啤酒']
]
def create_c1(dataset):
"""创建初始候选项集C1(1项集)"""
c1 = []
for transaction in dataset:
for item in transaction:
if [item] not in c1:
c1.append([item])
c1.sort()
return list(map(frozenset, c1))
def scan_dataset(dataset, candidates, min_support):
"""扫描数据集,筛选满足最小支持度的项集"""
item_count = {}
for transaction in dataset:
for candidate in candidates:
if candidate.issubset(transaction):
item_count.setdefault(candidate, 0)
item_count[candidate] += 1
num_transactions = float(len(dataset))
frequent_items = []
support_data = {}
for item in item_count:
support = item_count[item] / num_transactions
if support >= min_support:
frequent_items.append(item)
support_data[item] = support
return frequent_items, support_data
def apriori_gen(frequent_items, k):
"""生成候选项集Ck(连接+剪枝步骤)"""
candidates = []
len_fk = len(frequent_items)
for i in range(len_fk):
for j in range(i+1, len_fk):
# 连接步骤
itemset_i = list(frequent_items[i])
itemset_j = list(frequent_items[j])
itemset_i.sort()
itemset_j.sort()
if itemset_i[:k-2] == itemset_j[:k-2]:
new_candidate = frequent_items[i] | frequent_items[j]
# 剪枝步骤
if not has_infrequent_subset(new_candidate, frequent_items, k):
candidates.append(new_candidate)
return candidates
def has_infrequent_subset(candidate, frequent_items, k):
"""检查候选项是否包含非频繁子集"""
subsets = combinations(candidate, k-1)
for subset in subsets:
if frozenset(subset) not in frequent_items:
return True
return False
def apriori(dataset, min_support=0.3):
"""Apriori主算法"""
# 初始化
c1 = create_c1(dataset)
dataset = list(map(set, dataset))
# 生成L1
l1, support_data = scan_dataset(dataset, c1, min_support)
frequent_items = [l1]
k = 2
# 迭代生成高阶项集
while frequent_items[k-2]:
ck = apriori_gen(frequent_items[k-2], k)
lk, supK = scan_dataset(dataset, ck, min_support)
support_data.update(supK)
frequent_items.append(lk)
k += 1
return frequent_items, support_data
def generate_rules(frequent_items, support_data, min_confidence=0.7):
"""生成关联规则"""
rules = []
for i in range(1, len(frequent_items)):
for itemset in frequent_items[i]:
# 生成所有可能的子集
subsets = [frozenset([item]) for item in itemset]
if len(itemset) > 1:
rules_from_itemset(itemset, subsets, support_data, rules, min_confidence)
return rules
def rules_from_itemset(itemset, subsets, support_data, rules, min_confidence):
"""从单个项集生成规则"""
k = len(itemset)
if k > 1:
# 生成所有大小为1的后件
for subset in subsets:
antecedent = itemset - subset
consequent = subset
calculate_confidence(antecedent, consequent, support_data, rules, min_confidence)
# 递归生成更大后件
if k > 2:
new_subsets = []
for i in range(len(subsets)):
for j in range(i+1, len(subsets)):
new_subset = subsets[i] | subsets[j]
if new_subset not in new_subsets:
new_subsets.append(new_subset)
rules_from_itemset(itemset, new_subsets, support_data, rules, min_confidence)
def calculate_confidence(antecedent, consequent, support_data, rules, min_confidence):
"""计算规则置信度并筛选"""
support_both = support_data.get(antecedent | consequent, 0)
support_antecedent = support_data.get(antecedent, 0)
if support_antecedent > 0:
confidence = support_both / support_antecedent
if confidence >= min_confidence:
rules.append((antecedent, consequent, confidence))
# 主程序
if __name__ == "__main__":
min_support = 0.3
min_confidence = 0.6
dataset = load_dataset()
frequent_items, support_data = apriori(dataset, min_support)
rules = generate_rules(frequent_items, support_data, min_confidence)
# 打印频繁项集
print("频繁项集(最小支持度={}):".format(min_support))
for i, itemsets in enumerate(frequent_items):
if itemsets:
print("{}-项集:".format(i+1))
for itemset in itemsets:
print(" {} : 支持度={:.2f}".format(list(itemset), support_data[itemset]))
# 打印关联规则
print("\n关联规则(最小置信度={}):".format(min_confidence))
for antecedent, consequent, confidence in rules:
print("{} => {} : 置信度={:.2f}".format(
list(antecedent), list(consequent), confidence
))
执行结果示例
频繁项集(最小支持度=0.3):
1-项集:
['啤酒'] : 支持度=0.50
['面包'] : 支持度=0.80
['牛奶'] : 支持度=0.70
['鸡蛋'] : 支持度=0.70
2-项集:
['啤酒', '面包'] : 支持度=0.40
['啤酒', '牛奶'] : 支持度=0.30
['面包', '牛奶'] : 支持度=0.60
['面包', '鸡蛋'] : 支持度=0.50
['牛奶', '鸡蛋'] : 支持度=0.50
3-项集:
['啤酒', '面包', '牛奶'] : 支持度=0.30
['面包', '牛奶', '鸡蛋'] : 支持度=0.40
关联规则(最小置信度=0.6):
['牛奶'] => ['面包'] : 置信度=0.86
['鸡蛋'] => ['面包'] : 置信度=0.71
['牛奶'] => ['鸡蛋'] : 置信度=0.71
['鸡蛋'] => ['牛奶'] : 置信度=0.71
['牛奶', '鸡蛋'] => ['面包'] : 置信度=0.80
['面包', '鸡蛋'] => ['牛奶'] : 置信度=0.80
关键点说明
- 高效性:利用Apriori性质剪枝,避免"组合爆炸"
- 可解释性:生成的规则具有明确的业务含义(如购物推荐)
- 应用场景:
- 市场篮子分析
- 交叉销售推荐
- 网站页面跳转分析
- 医疗诊断模式发现
注意:实际应用中需调整
min_support和min_confidence参数,并在大型数据集时考虑优化实现(如FP-Growth算法)。