如何用neo4j对50万掘金用户进行分析(一)

2,371 阅读12分钟

前言

在掘金混迹有一段时间了,发现掘友圈子,都是帅气、有才、技术好的大佬。为深入了解掘友们,通过获取部分掘友用户公开信息,利用图数据库neo4j和图挖掘算法分析掘一下友圈那些鲜为人知的事儿。 ^_^

数据获取

利用 pythonrequests 写一个简单的爬虫,以站长 阴明 为起点,利用对掘金用户的followerfollowee 关系进行深度和广度遍历抓取,共获取到50万用户(目前掘友总数据估计在200万以上),711万follow关系。对于没有follow关系的用户,暂不做分析。

follow数据样例:

{
    "objectId": "56887f661482e8a3f10d5061",
    "follower": {
        "objectId": "56887e3e60b2a099cdd3d8df",
        "username": "逗论",
        ...
    },
    "followeeId": "557f9654e4b0d02dc2d7125d",
    "createdAtString": "2016-01-03T01:54:46.919Z",
    "updatedAtString": "2016-01-03T01:54:46.947Z"
}

user数据样例:

{
    "objectId": "5946111561ff4b006ced8427",
    "followersCount": 0,
    "followeesCount": 1,
    "username": "abu1497764117405",
    ...
}

关于 neo4j

neo4j是当前知识图谱领域,一款非常优秀的图数据存储和分析挖掘的主流(数据库)工具,当前已经更新至4.0版本。通过neo4j能够实现上亿级别的实体和关系挖掘。本文以掘友用户User和关注follow关系网分析为主,构建的(实体A,关系,实体B)三元关系组相对简单,主要为(User,follow,User)。如下图

数据导入

neo4j 导入数据方法有很多,如LOAD CSV能够读取CSV文件、neo4j-import能够对上亿级别数据快速初始化导入、apoc-jdbc能够直连常见关系型数据库mysql\oarcle等获取数据。

为简便使用,本次采集的数据以json格式存放在文件中。导入方式选择读取文件,加载json string并解析。

启动neo4j shell:

$ ./bin/cypher-shell

User添加索引,提高后续查询效率

CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;

Added 1 constraints
0 rows available after 205 ms, consumed after another 1 ms

导入用户User Node信息:

USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
MERGE (u:User{id:info["id"]})
ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
;
0 rows available after 99179 ms, consumed after another 0 ms
Added 500255 nodes, Set 18492458 properties, Added 500255 labels

共导入 500255 个用户信息,共耗时 99 秒,速度还不错,简单核查下结果。

neo4j> MATCH (n:User) RETURN COUNT(n);
+----------+
| COUNT(n) |
+----------+
| 500255   |
+----------+

1 row available after 10 ms, consumed after another 1 ms

导入用户follow relation信息:

USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
UNWIND info["d"] AS elem
WITH
info["type"] AS type,
CASE WHEN info["type"] = "Followee" THEN  elem["followerId"] WHEN info["type"] = "Follower" THEN  elem["follower"]["objectId"] END AS startId,
CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN  elem["followeeId"] END AS endId ,
elem["createdAtString"] AS createdAtString,
elem["updatedAtString"] AS updatedAtString
MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
MERGE(startNode)-[r:follow]->(endNode)
 ON CREATE SET r.createdAtString = createdAtString,
               r.updatedAtString = updatedAtString
;


0 rows available after 746892 ms, consumed after another 0 ms
Created 7115107 relationships, Set 14230214 properties

共导入 7115107 条用户的关注follow关系,共耗时 746 秒,核查下结果。

neo4j> MATCH p=()-[r:follow]->() RETURN COUNT(r);
+----------+
| COUNT(r) |
+----------+
| 7115107  |
+----------+

1 row available after 7 ms, consumed after another 0 ms

导入完成后,整个数据库文件 5.03 GB

数据初探

言归正传,在数据备好后,对掘友信息选取一些重要维度进行分析。

掘友用户增量情况


// 每月新增用户和关系情况
MATCH (u) WHERE EXISTS(u.createdAt)
WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
MATCH ()<-[r]-() 
WITH userMap,r.createdAtString AS createdAtString
WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
WITH createdAtString, num AS followNum  ,userMap[createdAtString] AS userNum
ORDER BY createdAtString
WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
UNWIND RANGE(0,LENGTH(infoList)-1) AS index
WITH infoList[index]["createdAtString"] AS createdAtString,
infoList[index]["userNum"] AS userNum,
infoList[index]["followNum"] AS followNum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
ORDER BY createdAtString DESC
LIMIT 10
;
+----------------------------------------------------------------------------------------+
| createdAtString | userNum | followNum | userNumSum | followNumSum | followPerUser      |
+----------------------------------------------------------------------------------------+
| "2019-12"       | 583     | 3436      | 496639.0   | 7115107.0    | 14.326516846240429 |
| "2019-11"       | 10098   | 172500    | 496056.0   | 7111671.0    | 14.336427742029127 |
| "2019-10"       | 11957   | 193671    | 485958.0   | 6939171.0    | 14.279363648710383 |
| "2019-09"       | 11995   | 202601    | 474001.0   | 6745500.0    | 14.230982635057732 |
| "2019-08"       | 12585   | 210558    | 462006.0   | 6542899.0    | 14.16193512638364  |
| "2019-07"       | 14042   | 228781    | 449421.0   | 6332341.0    | 14.089998019674203 |
| "2019-06"       | 11522   | 196869    | 435379.0   | 6103560.0    | 14.018958195043858 |
| "2019-05"       | 12924   | 223838    | 423857.0   | 5906691.0    | 13.935574969860118 |
| "2019-04"       | 13592   | 233482    | 410933.0   | 5682853.0    | 13.829147330586737 |
| "2019-03"       | 15747   | 283769    | 397341.0   | 5449371.0    | 13.714595272071092 |
+----------------------------------------------------------------------------------------+

10 rows available after 55237 ms, consumed after another 0 ms

从分析结果来看,掘友用户增一直保持着很稳定的态势。同时,随着用户的增多,掘友之间的关系也相对增多,也体现了用户之间的交互更加密切,是用户用户活跃度的简介体现。

元老掘友分析

来看下掘友圈,那些跟随者掘金一起成长的最早的一批元老们。

// 粉丝数、最早注册等统计, 每日新增粉丝
MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
WITH u,r.createdAtString AS createdAtString
WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
WITH u, MAX(num ) AS maxNum
RETURN u.username,
apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
u.followersCount AS followersCount,
u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
maxNum 
ORDER BY createdAt 
// ORDER BY aveNum DESC
// ORDER BY maxNum DESC
LIMIT 5
;
  • 最早注册掘金的元老们:
+-----------------------------------------------------------------------+
| u.username   | createdAt    | days | followersCount | aveNum | maxNum |
+-----------------------------------------------------------------------+
| "Ming_Zhe"   | "2015-03-24" | 1719 | 20             | 0      | 3      |
| "代码家"      | "2015-03-24" | 1719 | 140            | 0      | 6      |
| "稀土君"      | "2015-04-02" | 1709 | 52904          | 30     | 308    |
| "江昪"        | "2015-04-02" | 1710 | 9110           | 5      | 68     |
| "阴明"        | "2015-04-02" | 1709 | 74262          | 43     | 286    |
| "fskslafhd"  | "2015-04-15" | 1697 | 8              | 0      | 2      |
| "荆全齐"      | "2015-04-16" | 1696 | 96             | 0      | 4      |
| "晓风well"    | "2015-04-16" | 1696 | 192            | 0      | 6      |
| "开源小组"    | "2015-04-16" | 1696 | 43             | 0      | 3      |
| "Millie_Lin" | "2015-04-16" | 1696 | 7744           | 4      | 67     |
+-----------------------------------------------------------------------+
  • 最吸粉的大佬们(平均每日新增粉丝),平均一天吸粉近60。

+--------------------------------------------------------------------------+
| u.username      | createdAt    | days | followersCount | aveNum | maxNum |
+--------------------------------------------------------------------------+
| "石杉的架构笔记"  | "2018-11-05" | 396  | 22877          | 57     | 189    |
| "刘小夕"         | "2019-02-12" | 298  | 17086          | 57     | 326    |
| "闲鱼技术"       | "2018-04-03" | 613  | 27634          | 45     | 157    |
| "ConardLi"      | "2018-11-13" | 389  | 17373          | 44     | 285    |
| "美团技术团队"    | "2018-03-30" | 617  | 27576          | 44     | 283    |
| "阴明"           | "2015-04-02" | 1709 | 74262          | 43     | 286    |
| "Java3y"        | "2018-01-30" | 676  | 29522          | 43     | 144    |
| "腾讯云加社区"    | "2017-02-24" | 1016 | 44415          | 43     | 297    |
| "小姐姐味道"      | "2018-10-30" | 403  | 17025          | 42     | 157    |
| "ikoala"        | "2019-06-01" | 188  | 7550           | 40     | 134    |
+--------------------------------------------------------------------------+

10 rows available after 55589 ms, consumed after another 0 ms

掘友粉丝分布

// 粉丝分布
MATCH (u:User) 
WHERE EXISTS(u.followersCount)
RETURN COUNT(u.followersCount) AS count,
AVG(u.followersCount) AS ave, 
percentileDisc(u.followersCount, 0.5) AS `50%`, 
percentileDisc(u.followersCount, 0.75) AS `75%`, 
percentileDisc(u.followersCount, 0.90) AS `90%`, 
percentileDisc(u.followersCount, 0.95) AS `95%`, 
percentileDisc(u.followersCount, 0.99) AS `99%`, 
percentileDisc(u.followersCount, 0.999) AS `99.9%`, 
percentileDisc(u.followersCount, 0.9999) AS `99.99%`, 
percentileDisc(u.followersCount, 0.99999) AS `99.999%`, 
percentileDisc(u.followersCount, 1) AS `100%`
;
+---------------------------------------------------------------------------------------------+
| count  | ave               | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100%  |
+---------------------------------------------------------------------------------------------+
| 499934 | 15.86762652670079 | 0   | 0   | 1   | 2   | 58  | 3721  | 17025  | 37657   | 74262 |
+---------------------------------------------------------------------------------------------+

1 row available after 4029 ms, consumed after another 0 ms

可以看到超过75%的掘友还没有粉丝额。要想排在掘友 top 1 内,也至少有58个粉丝。再看下 top1内的大佬吸粉情况:

// 粉丝占比
MATCH (u:User) 
WHERE EXISTS(u.followersCount) 
WITH 
SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
SUM(u.followersCount) AS followersCountSum
RETURN low,high,followersCountSum,
low*1.0/followersCountSum AS lowPercent,
high*1.0/followersCountSum AS highPercent
;
+---------------------------------------------------------------------------------+
| low    | high    | followersCountSum | lowPercent          | highPercent        |
+---------------------------------------------------------------------------------+
| 210930 | 7721836 | 7932766           | 0.02658971662595367 | 0.9734102833740463 |
+---------------------------------------------------------------------------------+

1 row available after 1399 ms, consumed after another 0 ms

果然,“97% 的粉丝都掌握在 1% 的 大佬手中”,掘友的粉丝贫富差距肉眼可见。

那些高产的大佬们

 // 发布文章最多的,发布沸点最多的
MATCH (u:User)  WHERE
// EXISTS(u.pinCount) 
EXISTS(u.postedPostsCount) 
RETURN u.id AS id,
u.username AS username,
u.pinCount AS pinCount,
u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd')  AS pinCountAve,
u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd')  AS postedPostsCountAve,
u.totalCollectionsCount AS totalCollectionsCount,
u.postedPostsCount AS postedPostsCount,
u.totalViewsCount AS   totalViewsCount
ORDER BY postedPostsCount DESC 
LIMIT 10
;
  • 发布文章最多的大佬们:

+----------------------------------------------------------------------------+
| username        | postedPostsCountAve | postedPostsCount | totalViewsCount |
+----------------------------------------------------------------------------+
| "乌云知识库"      | 1                   | 1224             | 18030           |
| "刘旷"           | 1                   | 1066             | 8336            |
| "千锋JAVA开发"   | 1                   | 996              | 23818           |
| "阿里云云栖社区"  | 0                   | 932              | 152109          |
| "芊宝宝"         | 1                   | 922              | 19598           |
| "LLrZqSg3"      | 2                   | 913              | 506             |
| "go4it"         | 0                   | 780              | 153028          |
| "JerryWang_sap" | 1                   | 771              | 7178            |
| "腾讯云加社区"    | 0                   | 759              | 685628          |
| "云计算百科"     | 1                   | 754              | 6543            |
+----------------------------------------------------------------------------+

10 rows available after 1745 ms, consumed after another 0 ms
  • 发布沸点最多的大佬们
+--------------------------------------------------------------+
| username          | pinCount | pinCountAve | totalViewsCount |
+--------------------------------------------------------------+
| "网路冷眼"         | 19397    | 21          | NULL            |
| "湾区日报"         | 6836     | 10          | 0               |
| "神奇的命令行"      | 1989     | 1           | 108035          |
| "程序员趣事"        | 1583     | 2           | 0               |
| "树洞robot"        | 1530     | 4           | 0               |
| "HackerNews什么值  | 1510     | 4           | 0               |
| "爱可可-爱生活"     | 1484     | 4           | 58              |
| "科技新闻搬运工"     | 1476     | 2           | 0               |
| "娱乐小编"          | 1283     | 2           | 2761            |
| "王兴的饭否"        | 1264     | 1           | 0               |
+--------------------------------------------------------------+
10 rows available after 726 ms, consumed after another 1 ms

影响力分析挖掘

前面对掘友圈进行了粗略分析,粉丝数、文章数等都是单维度的,那么最有影响力掘友该如何分析,采用pageRank进行建模,如下

// pageRank 影响力挖掘分析
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH u0,algo.getNodeById(nodeId) AS u, score
WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
RETURN 
u.username AS username,
score,hasRelation
ORDER BY score DESC 
LIMIT 20
;
       
+---------------------------------------------------+
| username       | score              | hasRelation |
+---------------------------------------------------+
| "阴明"          | 5492.921815348415  | 0           |
| "HollisChuang" | 3675.980443021097  | 1           |
| "漫话编程"       | 3543.827678073035  | 1           |
| "超人汪小建"     | 3378.3619996590537 | 1           |
| "稀土君"        | 2878.5440236374734 | 1           |
| "stormzhangV"  | 2855.733890181873  | 0           |
| "前端外刊评论"   | 2841.045798705519  | 1           |
| "liutao"       | 2481.8655670162293 | 1           |
| "膜法小编"       | 2047.3083251186647 | 1           |
| "LucasHC"      | 2030.568352384912  | 0           |
| "NeXT"         | 1922.9859172880645 | 1           |
| "李CHENGXI"     | 1752.8077864374964 | 1           |
| "水墨寒"        | 1528.3845290698114 | 1           |
| "清蒸不是水煮"   | 1471.3738727076911 | 1           |
| "蚂蚁金服数据.." | 1287.2166685772129 | 1           |
| "丁一"          | 1252.5015809553672 | 1           |
| "美团技术团队"   | 1250.5822049211245 | 1           |
| "江昪"          | 1234.908881070744  | 1           |
| "腾讯云加社区"    | 1169.060374834016  | 1           |
| "薄荷前端"       | 1093.3464047224304 | 0           |
+---------------------------------------------------+

20 rows available after 5093 ms, consumed after another 0 ms

从结果来看,站长 阴明 的影响力还是毋庸置疑的,遥遥领先其他掘友。且排在前面的用户,大部分都跟站长有直接的关(hasRelation=1)系。这忧郁的眼神,唏嘘的胡渣子。也不知道大家是喜欢站长,还是喜欢站长的女装。 ^_^

// pageRank 影响力前10 的用户之间的关系

CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS u, score
WITH u,score ORDER BY  score DESC  LIMIT 10
WITH COLLECT(u) AS Users
UNWIND Users AS u1
UNWIND Users AS u2
MATCH p=(u1)-[r:follow]-(u2)
RETURN p;

影响力前10名的的大佬之间,错综复杂,剪不断理还乱的关系,果然,大佬们都是在玩圈子啊。

互粉情况分析

话说,互粉是中华民族的传统美德,来看下,哪些大拿和粉丝互粉互动较多呢?

// 互粉的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
RETURN COUNT(DISTINCT u1.id+u2.id) AS num
;
+-------+
| num   |
+-------+
| 22732 |
+-------+

1 row available after 32644 ms, consumed after another 0 ms

可以看出,掘友圈友超过22万互粉用户。

// 自我关注的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
RETURN COUNT(DISTINCT u1.id) AS num
;
+-----+
| num |
+-----+
| 147 |
+-----+

1 row available after 0 ms, consumed after another 1351 ms

当然,也不乏自我互粉的掘友,你是这147位掘友中的一员吗。

// 互粉最多的用户
MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
WITH u1,u2
WITH u1,COUNT(u2) AS num
RETURN  u1.username,num
ORDER BY num DESC
LIMIT 20
;

+--------------------------+
| u1.username        | num |
+--------------------------+
| "liutao"           | 641 |
| "膜法小编"          | 449 |
| "阴明"              | 367 |
| "黑马UI"            | 322 |
| "pilishen"         | 315 |
| "拥抱心中的梦想"     | 296 |
| "powerzhuye"       | 260 |
| "zhennann"         | 237 |
| "Mockplus"         | 230 |
| "MarvinZhang"      | 184 |
| "清蒸不是水煮"       | 176 |
| "一颗香菜"          | 154 |
| "断天涯大虾"        | 152 |
| "爱原型爱设计"       | 151 |
| "闻人的技术博客"     | 141 |
| "bytedance"        | 139 |
| "Bug开发者"         | 136 |
| "江昪"             | 123 |
| "NervosNetwork"    | 118 |
+--------------------------+
20 rows available after 23 ms, consumed after another 35454 ms

以上大佬对粉丝很热情额,大家可以多多关注,说不定能得到大佬的互粉青睐。

本次初步分析至此,基于该数据的持续深入分析,见后续文章。

注: 分析代码见 github.com/saiwaiyanyu…

此次数据仅做分析学习使用。不分享,不开源,不用做任何商业用途。