pyspark3.4.1 spark.ml ALS算法

326 阅读5分钟

image.png

ALS算法

ALS是Alternating Least Squares的缩写,交替最小二乘法。

一、定义与原理

  1. 「定义」:交替最小二乘法(Alternating Least Squares,简称ALS)是一种优化算法,主要用于求解推荐系统模型中的最优解。
  2. 「原理」:ALS算法的基本原理是迭代式求解一系列最小二乘回归问题。它试图将一个用户评分矩阵近似分解为两个低秩矩阵的乘积,这两个低秩矩阵分别代表用户和物品的隐语义特征向量。在迭代过程中,ALS会先固定一个矩阵(例如用户特征矩阵),然后求解另一个矩阵(例如物品特征矩阵)的最优解,接着再固定物品特征矩阵,求解用户特征矩阵的最优解。如此反复迭代,直到达到收敛条件。

二、应用与特点

  1. 「应用」:在Spark.ml中,ALS算法被广泛应用于推荐系统中,用于预测用户对物品的评分或偏好,从而为用户推荐可能感兴趣的物品。

  2. 「特点」

    • ALS算法属于User-Item CF(混合协同过滤),即同时考虑了用户和物品两个方面,因此可以基于用户进行推荐,也可以基于物品进行推荐。
    • ALS算法通过迭代求解最小二乘问题来优化模型参数,因此具有较好的收敛性和精度。
    • ALS算法支持并行化计算,可以处理大规模数据集,提高计算效率。

三、参数与优化

  1. 「参数」:ALS算法的主要参数包括迭代次数(iterations)、正则化参数(lambda)、特征向量维度(rank)等。这些参数的选择对模型的性能和精度有重要影响。

    • 「迭代次数」:迭代次数越多,模型通常越精确,但计算时间也会相应增加。因此,需要在精度和计算时间之间找到平衡点。
    • 「正则化参数」:正则化参数用于防止模型过拟合。较大的正则化参数可以增强模型的泛化能力,但也可能导致欠拟合;较小的正则化参数则可能导致过拟合。
    • 「特征向量维度」:特征向量维度决定了用户和物品隐语义特征向量的复杂程度。维度过低可能导致信息丢失,维度过高则可能导致模型过于复杂,难以训练。
  2. 「优化」:为了提高ALS算法的性能和精度,可以采取以下优化措施:

    • 「选择合适的参数」:根据数据集的特点和应用场景,选择合适的迭代次数、正则化参数和特征向量维度。
    • 「数据预处理」:对原始数据进行预处理,如去重、填充缺失值等,以提高数据质量。
    • 「并行化计算」:利用Spark的并行化计算能力,加速ALS算法的训练过程。

综上所述,Spark.ml中的ALS算法是一种基于交替最小二乘法的协同过滤算法,具有广泛的应用场景和优良的性能表现。通过合理设置参数和优化措施,可以进一步提高ALS算法的性能和精度。

# -*- coding: utf-8 -*-  
# @Time : 2024/12/25 11:11  
# @Author   : pblh123@126.com  
# @File : pyspark_logisticRegressionCrossValidator.py
# @Describe : 基于逻辑斯蒂回归分类算法的交叉验证  
import warnings  
  
from pyspark.ml import Pipeline  
from pyspark.ml.classification import LogisticRegression  
from pyspark.ml.evaluation import MulticlassClassificationEvaluator  
from pyspark.ml.feature import VectorAssembler, StringIndexer, VectorIndexer, IndexToString  
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator  
from pyspark.sql import SparkSession  
from pyspark.sql.types import DoubleType  
  
from utils.window_Utils import windows_enviroment_set  
  
# 过滤警告信息  
warnings.simplefilter("ignore")  
  
windows_enviroment_set()  
  
def main():  
    # 读取数据  
    path = r"D:\PycharmProjects\2024\pyspark\datas\iris.txt"  
    df_raw = spark.read.option("inferSchema", "true").csv(path).toDF("c0", "c1", "c2", "c3", "label")  
    df_raw.show(5)  
    # 转换列数据类型为Double  
    df_double = df_raw.select(  
        df_raw["c0"].cast(DoubleType()),  
        df_raw["c1"].cast(DoubleType()),  
        df_raw["c2"].cast(DoubleType()),  
        df_raw["c3"].cast(DoubleType()),  
        df_raw["label"]  
    )  
  
    # 创建特征向量  
    assembler = VectorAssembler(inputCols=["c0", "c1", "c2", "c3"], outputCol="features")  
    data = assembler.transform(df_double).select("features", "label")  
    # 划分训练集和测试集  
    trainingData, testData = data.randomSplit([0.7, 0.3])  
    # 对标签进行索引  
    labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)  
    # 对特征向量进行索引  
    featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures").fit(data)  
  
    # 创建Logistic Regression模型  
    lr = LogisticRegression(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=50)  
  
    # 将预测结果转换为原始标签  
    labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)  
    # 创建Pipeline  
    lrPipeline = Pipeline(stages=[labelIndexer, featureIndexer, lr, labelConverter])  
  
    # 创建参数网格  
    paramGrid = ParamGridBuilder() \  
        .addGrid(lr.elasticNetParam, [0.2, 0.8]) \  
        .addGrid(lr.regParam, [0.01, 0.1, 0.5]) \  
        .build()  
    # 创建交叉验证器  
    cv = CrossValidator(estimator=lrPipeline, estimatorParamMaps=paramGrid,  
                        evaluator=MulticlassClassificationEvaluator(labelCol="indexedLabel",  
                                                                    predictionCol="prediction"), numFolds=3)  
  
    # 在训练数据上进行交叉验证  
    cvModel = cv.fit(trainingData)  
  
    # 使用交叉验证模型进行预测  
    lrPredictions = cvModel.transform(testData)  
  
    # 显示前5个预测结果  
    lrPredictions.select("predictedLabel", "label", "features", "probability").show(5)  
  
    # 遍历并输出5个预测结果  
    for row in lrPredictions.select("predictedLabel", "label", "features", "probability").collect()[:5]:  
        predictedLabel = row["predictedLabel"]  
        label = row["label"]  
        features = row["features"]  
        prob = row["probability"]  
        print(f"({label}, {features}) --> prob={prob}, predicted Label={predictedLabel}")  
  
    # 创建多类分类评估器  
    evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction")  
  
    # 计算模型的准确度  
    lrAccuracy = evaluator.evaluate(lrPredictions)  
    print("Model Accuracy:", lrAccuracy)  
  
    # 获取最佳模型  
    bestModel = cvModel.bestModel  
  
    # 输出模型的系数、截距、类别数和特征数  
    print("Coefficients: " + str(bestModel.stages[2].coefficientMatrix))  
    print("Intercept: " + str(bestModel.stages[2].interceptVector))  
    print("numClasses: " + str(bestModel.stages[2].numClasses))  
    print("numFeatures: " + str(bestModel.stages[2].numFeatures))  
    # 解释模型的regParam参数  
    print(bestModel.stages[2].explainParam("regParam"))  
    # 解释模型的elasticNetParam参数  
    print(bestModel.stages[2].explainParam("elasticNetParam"))  
  
if __name__ == '__main__':  
    # 1. 创建SparkSession  
    spark = SparkSession.builder \  
        .appName("Pysparkmllibals_spark341") \  
        .master("local[2]") \  
        .getOrCreate()  
    # sparkcontext  
    sc = spark.sparkContext  
    spark.sparkContext.setLogLevel("WARN")  
  
    # 2. spark业务代码 business code    main()  
  
    # 3. 关闭sparkSession, sparkcontext  
    sc.stop()  
    spark.stop()

操作日志

D:\PycharmProjects\2024\pyspark.venv\Scripts\python.exe D:\PycharmProjects\2024\pyspark\chapter9\logisticRegressionCrossValidator.py 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+---+---+---+---+-----------+
| c0| c1| c2| c3|      label|
+---+---+---+---+-----------+
|5.1|3.5|1.4|0.2|Iris-setosa|
|4.9|3.0|1.4|0.2|Iris-setosa|
|4.7|3.2|1.3|0.2|Iris-setosa|
|4.6|3.1|1.5|0.2|Iris-setosa|
|5.0|3.6|1.4|0.2|Iris-setosa|
+---+---+---+---+-----------+
only showing top 5 rows

24/12/25 11:20:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
+---------------+---------------+-----------------+--------------------+
| predictedLabel|          label|         features|         probability|
+---------------+---------------+-----------------+--------------------+
|    Iris-setosa|    Iris-setosa|[4.6,3.2,1.4,0.2]|[0.97567608016892...|
|    Iris-setosa|    Iris-setosa|[4.6,3.4,1.4,0.3]|[0.98408104445436...|
|    Iris-setosa|    Iris-setosa|[4.8,3.0,1.4,0.3]|[0.93557770311776...|
|    Iris-setosa|    Iris-setosa|[4.9,3.1,1.5,0.1]|[0.95163310017384...|
|Iris-versicolor|Iris-versicolor|[5.0,2.0,3.5,1.0]|[0.03396547744172...|
+---------------+---------------+-----------------+--------------------+
only showing top 5 rows

(Iris-setosa, [4.6,3.2,1.4,0.2]) --> prob=[0.9756760801689209,0.02432345802265153,4.6180842767602953e-07], predicted Label=Iris-setosa
(Iris-setosa, [4.6,3.4,1.4,0.3]) --> prob=[0.9840810444543696,0.015918606910044902,3.486355855680056e-07], predicted Label=Iris-setosa
(Iris-setosa, [4.8,3.0,1.4,0.3]) --> prob=[0.9355777031177671,0.06441993618513617,2.3606970966376707e-06], predicted Label=Iris-setosa
(Iris-setosa, [4.9,3.1,1.5,0.1]) --> prob=[0.9516331001738487,0.04836613281033871,7.670158125827126e-07], predicted Label=Iris-setosa
(Iris-versicolor, [5.0,2.0,3.5,1.0]) --> prob=[0.033965477441725365,0.928678260536534,0.03735626202174075], predicted Label=Iris-versicolor
Model Accuracy: 0.9310344827586208
Coefficients: DenseMatrix([[-1.10941617,  2.52972276, -0.99886338, -2.15963299],
             [ 0.46649798, -0.3035165 , -0.09557151, -0.81845622],
             [ 0.34783758, -1.64972306,  1.23489926,  3.30225318]])
Intercept: [5.1732165093512155,1.7658436254105918,-6.939060134761807]
numClasses: 3
numFeatures: 4
regParam: regularization parameter (>= 0). (default: 0.0, current: 0.01)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0, current: 0.2)

Process finished with exit code 0