[关联规则]apriori算法(介绍)

591 阅读5分钟

Apriori算法详解

支持度, 置信度, 频繁项集, 关联规则.

满足支持度的就是频繁项集, 这里可以是频繁1项集, 频繁2项集, 频繁3项集,.......

最后在频繁项集中, 挖掘关联规则.

1. 算法定义

Apriori算法是一种经典的关联规则挖掘算法,用于发现大型数据集中项之间的有趣关系(如购物篮分析中的"啤酒与尿布"现象)。其核心目标是找出频繁出现的物品组合(频繁项集),并基于此生成关联规则。

2. 核心概念

  • 支持度(Support):项集出现的频率
    Support(A)=包含A的交易数总交易数\text{Support}(A) = \frac{\text{包含A的交易数}}{\text{总交易数}}
  • 置信度(Confidence):规则A→B的可信度
    Confidence(AB)=Support(AB)Support(A)\text{Confidence}(A→B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)}
  • 频繁项集:满足最小支持度阈值的物品组合
  • 关联规则:形如A→B的规则,需满足最小支持度和置信度

3. 算法原理

基于Apriori性质:"频繁项集的所有非空子集也必须是频繁的"。算法通过逐层搜索(层级法)高效发现频繁项集:

  1. 连接步骤:将k-1项集连接生成候选k项集
  2. 剪枝步骤:利用Apriori性质剪枝包含非频繁子集的候选项

4. 算法流程

  1. 扫描数据集,统计所有1项集的支持度
  2. 筛选满足最小支持度的频繁1项集L₁
  3. 循环直到无法生成新项集:
    • 由Lₖ生成候选k+1项集Cₖ₊₁
    • 扫描数据集计算Cₖ₊₁支持度
    • 筛选得到Lₖ₊₁
  4. 从所有频繁项集生成关联规则

Python代码实现

from itertools import combinations

def load_dataset():
    """创建示例数据集(购物篮交易数据)"""
    return [
        ['牛奶', '面包', '鸡蛋'],
        ['牛奶', '面包', '薯片', '啤酒'],
        ['面包', '鸡蛋', '啤酒'],
        ['牛奶', '鸡蛋', '啤酒'],
        ['面包', '牛奶', '鸡蛋'],
        ['面包', '牛奶', '鸡蛋', '啤酒'],
        ['牛奶', '鸡蛋'],
        ['面包', '鸡蛋'],
        ['面包', '牛奶', '鸡蛋'],
        ['面包', '牛奶', '啤酒']
    ]

def create_c1(dataset):
    """创建初始候选项集C1(1项集)"""
    c1 = []
    for transaction in dataset:
        for item in transaction:
            if [item] not in c1:
                c1.append([item])
    c1.sort()
    return list(map(frozenset, c1))

def scan_dataset(dataset, candidates, min_support):
    """扫描数据集,筛选满足最小支持度的项集"""
    item_count = {}
    for transaction in dataset:
        for candidate in candidates:
            if candidate.issubset(transaction):
                item_count.setdefault(candidate, 0)
                item_count[candidate] += 1
    
    num_transactions = float(len(dataset))
    frequent_items = []
    support_data = {}
    
    for item in item_count:
        support = item_count[item] / num_transactions
        if support >= min_support:
            frequent_items.append(item)
            support_data[item] = support
    
    return frequent_items, support_data

def apriori_gen(frequent_items, k):
    """生成候选项集Ck(连接+剪枝步骤)"""
    candidates = []
    len_fk = len(frequent_items)
    
    for i in range(len_fk):
        for j in range(i+1, len_fk):
            # 连接步骤
            itemset_i = list(frequent_items[i])
            itemset_j = list(frequent_items[j])
            itemset_i.sort()
            itemset_j.sort()
            
            if itemset_i[:k-2] == itemset_j[:k-2]:
                new_candidate = frequent_items[i] | frequent_items[j]
                
                # 剪枝步骤
                if not has_infrequent_subset(new_candidate, frequent_items, k):
                    candidates.append(new_candidate)
    return candidates

def has_infrequent_subset(candidate, frequent_items, k):
    """检查候选项是否包含非频繁子集"""
    subsets = combinations(candidate, k-1)
    for subset in subsets:
        if frozenset(subset) not in frequent_items:
            return True
    return False

def apriori(dataset, min_support=0.3):
    """Apriori主算法"""
    # 初始化
    c1 = create_c1(dataset)
    dataset = list(map(set, dataset))
    
    # 生成L1
    l1, support_data = scan_dataset(dataset, c1, min_support)
    frequent_items = [l1]
    k = 2
    
    # 迭代生成高阶项集
    while frequent_items[k-2]:
        ck = apriori_gen(frequent_items[k-2], k)
        lk, supK = scan_dataset(dataset, ck, min_support)
        support_data.update(supK)
        frequent_items.append(lk)
        k += 1
    
    return frequent_items, support_data

def generate_rules(frequent_items, support_data, min_confidence=0.7):
    """生成关联规则"""
    rules = []
    for i in range(1, len(frequent_items)):
        for itemset in frequent_items[i]:
            # 生成所有可能的子集
            subsets = [frozenset([item]) for item in itemset]
            if len(itemset) > 1:
                rules_from_itemset(itemset, subsets, support_data, rules, min_confidence)
    return rules

def rules_from_itemset(itemset, subsets, support_data, rules, min_confidence):
    """从单个项集生成规则"""
    k = len(itemset)
    if k > 1:
        # 生成所有大小为1的后件
        for subset in subsets:
            antecedent = itemset - subset
            consequent = subset
            calculate_confidence(antecedent, consequent, support_data, rules, min_confidence)
        
        # 递归生成更大后件
        if k > 2:
            new_subsets = []
            for i in range(len(subsets)):
                for j in range(i+1, len(subsets)):
                    new_subset = subsets[i] | subsets[j]
                    if new_subset not in new_subsets:
                        new_subsets.append(new_subset)
            rules_from_itemset(itemset, new_subsets, support_data, rules, min_confidence)

def calculate_confidence(antecedent, consequent, support_data, rules, min_confidence):
    """计算规则置信度并筛选"""
    support_both = support_data.get(antecedent | consequent, 0)
    support_antecedent = support_data.get(antecedent, 0)
    
    if support_antecedent > 0:
        confidence = support_both / support_antecedent
        if confidence >= min_confidence:
            rules.append((antecedent, consequent, confidence))

# 主程序
if __name__ == "__main__":
    min_support = 0.3
    min_confidence = 0.6
    
    dataset = load_dataset()
    frequent_items, support_data = apriori(dataset, min_support)
    rules = generate_rules(frequent_items, support_data, min_confidence)
    
    # 打印频繁项集
    print("频繁项集(最小支持度={}):".format(min_support))
    for i, itemsets in enumerate(frequent_items):
        if itemsets:
            print("{}-项集:".format(i+1))
            for itemset in itemsets:
                print("  {} : 支持度={:.2f}".format(list(itemset), support_data[itemset]))
    
    # 打印关联规则
    print("\n关联规则(最小置信度={}):".format(min_confidence))
    for antecedent, consequent, confidence in rules:
        print("{} => {} : 置信度={:.2f}".format(
            list(antecedent), list(consequent), confidence
        ))

执行结果示例

频繁项集(最小支持度=0.3):
1-项集:
  ['啤酒'] : 支持度=0.50
  ['面包'] : 支持度=0.80
  ['牛奶'] : 支持度=0.70
  ['鸡蛋'] : 支持度=0.70
2-项集:
  ['啤酒', '面包'] : 支持度=0.40
  ['啤酒', '牛奶'] : 支持度=0.30
  ['面包', '牛奶'] : 支持度=0.60
  ['面包', '鸡蛋'] : 支持度=0.50
  ['牛奶', '鸡蛋'] : 支持度=0.50
3-项集:
  ['啤酒', '面包', '牛奶'] : 支持度=0.30
  ['面包', '牛奶', '鸡蛋'] : 支持度=0.40

关联规则(最小置信度=0.6):
['牛奶'] => ['面包'] : 置信度=0.86
['鸡蛋'] => ['面包'] : 置信度=0.71
['牛奶'] => ['鸡蛋'] : 置信度=0.71
['鸡蛋'] => ['牛奶'] : 置信度=0.71
['牛奶', '鸡蛋'] => ['面包'] : 置信度=0.80
['面包', '鸡蛋'] => ['牛奶'] : 置信度=0.80

关键点说明

  1. 高效性:利用Apriori性质剪枝,避免"组合爆炸"
  2. 可解释性:生成的规则具有明确的业务含义(如购物推荐)
  3. 应用场景
    • 市场篮子分析
    • 交叉销售推荐
    • 网站页面跳转分析
    • 医疗诊断模式发现

注意:实际应用中需调整min_supportmin_confidence参数,并在大型数据集时考虑优化实现(如FP-Growth算法)。