商品排序重构

背景

起初的商品排序处理是简单粗暴地符合条件的数据从 ES 中捞出来在，然后在应用代码中进行排序 ( 可以看出此处只是简单将 ES 作为数据汇合源来使用 )，随着数据增多以及规则多样，成本越来越高 ( 包括应用代码复杂度、ES内存网络消耗，在最近一次压力测试中，查询瓶颈在 ES JVM 对象回收上 )，综合排序以及优化需求，一方面是降低应用复杂度，另一方面则是降低无效数据，ES 打分机制非常吻合需求。

规则0：通过打分排序

{
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "spu_id": {
        "order": "asc"
      }
    }
  ]
}

规则1：门店优先级

对于同一件商品，在三类门店均上架，则按“ 云仓、网店、全球购 ”优先级选择，通过 msearch ( 一次网络 IO ) 将数据获取然后在内存选择

{ "routing": ""} # header
{ "query":{ } } # body
{ "routing": ""} # header
{ "query":{ } } # body
{ "routing": ""} # header
{ "query":{ } } # body

规则2: 售罄置底 ( 零分 )

{
  "query": {
    "boosting": {
      "positive": { },
      "negative": {
        "term": {
          "is_sold_out": true
        }
      },
      "negative_boost": 0
    }
  }
}

如上，也可以通过 function score 实现类似的功能，不过没有 boosting 方便简洁。

排序

1 字段排序：按销量排序 / 按价格排序

一般情况下，以 sales / price 的数值作为分数即可完成排序，而针对 sales 为 0 和售罄为 0 的冲突，要么默认 sales 从 1 开始计数，要么使用 function score + painless：

{
  "query": {
    "function_score": {
      "query": {},
      "functions": [
        {
          "field_value_factor": {
            "field": "sales" # or price
          }
        }
      ]
    }
  }
}

考虑到存在 sales 或者 price 数值过大的问题，可以针对 sales / price 进行平滑处理 ( 同时为其他排序留出数值空间 )，如配置 modifier 为 log / ln，而换成倒数 ( reciprocal ) 即可完成降序排序，需要注意对 0 的处理 ( 可以使用 log2p / ln2p 以及 painless )

2 列表排序：按指定品牌顺序排序 / 按指定类目顺序排序 / 按指定商品顺序排序

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "FIELD": "VALUE"
              }
            },
            "boost": 1000
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "FIELD": "VALUE"
              }
            },
            "boost": 2000
          }
        }
      ]
    }
  }
}

可以联想到：默认排序：云仓、网店、全球购 ( 极速达置顶 )，实际上就是指定门店列表排序，同时对极速达属性加分。

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "terms": {
                "delivery_attr": ["3", "4"]
              }
            },
            "boost": 100
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "store_id": "6640"
              }
            },
            "boost": 3000
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "store_id": "4830"
              }
            },
            "boost": 2000
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "store_id": "9999"
              }
            },
            "boost": 1000
          }
        }
      ]
    }
  }
}

那么针对更复 ( diao ) 杂 ( zhuan ) 排序，如：order by brand_id, category_id, spu_id

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "term": {
                "brand_id": "VALUE"
              }
            },
            "boost": 100000
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "category_id": "VALUE"
              }
            },
            "boost": 10000
          }
        },
        {
          "constant_score": {
            "filter": {
              "term": {
                "spu_id": "VALUE"
              }
            },
            "boost": 1000
          }
        }
      ]
    }
  }
}

如上，本质的关键在于合理分配分数数值空间避免重合 ( 可以看到构造的 DSL 是扁平户，在分数上体现层级关系 )，进一步延伸，可以创建打分 DSL，如：

select spu_id, store_id order by brand_id:[], category_id:[], spu_id:[], sales

拓展：Join In ES

原有设计上，数据均以门店维度划分，因此非常适合扁平化设计，即：spu_id + store_id ( 并且商品属性变化不多 )，而在全门店销量统计场景中，一旦某一商品在某一门店销售，那么在现有的设计下，需要更新门店数量的文档，处理一种不可控状态，因此针对全门店销量，引入类似 Join 的机制 ( 仅仅是类似 )：Parent-Child，这种结构下，销量的更新可以定量为分片数量 ( 这个是可以非常确定的数量 )：

{
  "mappings": {
    "properties": {
      "join_field": {
        "type": "join",
        "relations": {
          "goods": "store"
        }
      }
    }
  }
}

这种结构下优势：商品信息更新和门店信息更新相互独立 ( 父子之间是独立文档，这同时也区别于 nested )；劣势：较于扁平式的设计性能不足，以及 reindex 需要配合脚本。

ES DSL -- order by [ 有货, 售罄 ], [ 商品列表 ] ：

{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": { "term": { "isEmptyStock": true } },
            "boost": 900
          }
        },
        {
          "constant_score": {
            "filter": { "term": { "isEmptyStock": false } },
            "boost": 1800
          }
        },
        {
          "has_parent": {
            "query": {
              "function_score": {
                "query": {
                  "bool": {
                    "should": [
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1227717 } },
                          "boost": 100
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1223522 } },
                          "boost": 200
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1216373 } },
                          "boost": 300
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1232955 } },
                          "boost": 400
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1215704 } },
                          "boost": 500
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1215076 } },
                          "boost": 600
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1215096 } },
                          "boost": 700
                        }
                      },
                      {
                        "constant_score": {
                          "filter": { "term": { "spuId": 1233808 } },
                          "boost": 800
                        }
                      }
                    ]
                  }
                },
                "functions": [
                  {
                    "field_value_factor": {
                      "field": "sales",
                      "modifier": "log2p"
                    }
                  }
                ],
                "boost_mode": "sum"
              }
            },
            "parent_type": "goods",
            "score": true
          }
        }
      ]
    }
  },
  "_source": { "includes": ["spuId"] },
  "sort": [
    { "_score": { "order": "desc" } },
    { "spuId": { "order": "asc" } }
  ]
}

从上面的示例中示例中可以看到，排序 DSL 只适用于离散型数值，原因也很明显 -- 因为连续性数值没有办法计算合适的分值空间。

商品排序 DSL -- 条件打分排序