前言
在掘金混迹有一段时间了,发现掘友圈子,都是帅气、有才、技术好的大佬。为深入了解掘友们,通过获取部分掘友用户公开信息,利用图数据库neo4j和图挖掘算法分析掘一下友圈那些鲜为人知的事儿。 ^_^
数据获取
利用 python的requests 写一个简单的爬虫,以站长 阴明 为起点,利用对掘金用户的follower和 followee 关系进行深度和广度遍历抓取,共获取到50万用户(目前掘友总数据估计在200万以上),711万的follow关系。对于没有follow关系的用户,暂不做分析。
follow数据样例:
{
"objectId": "56887f661482e8a3f10d5061",
"follower": {
"objectId": "56887e3e60b2a099cdd3d8df",
"username": "逗论",
...
},
"followeeId": "557f9654e4b0d02dc2d7125d",
"createdAtString": "2016-01-03T01:54:46.919Z",
"updatedAtString": "2016-01-03T01:54:46.947Z"
}
user数据样例:
{
"objectId": "5946111561ff4b006ced8427",
"followersCount": 0,
"followeesCount": 1,
"username": "abu1497764117405",
...
}
关于 neo4j
neo4j是当前知识图谱领域,一款非常优秀的图数据存储和分析挖掘的主流(数据库)工具,当前已经更新至4.0版本。通过neo4j能够实现上亿级别的实体和关系挖掘。本文以掘友用户User和关注follow关系网分析为主,构建的(实体A,关系,实体B)三元关系组相对简单,主要为(User,follow,User)。如下图

数据导入
neo4j 导入数据方法有很多,如LOAD CSV能够读取CSV文件、neo4j-import能够对上亿级别数据快速初始化导入、apoc-jdbc能够直连常见关系型数据库mysql\oarcle等获取数据。
为简便使用,本次采集的数据以json格式存放在文件中。导入方式选择读取文件,加载json string并解析。
启动neo4j shell:
$ ./bin/cypher-shell
给User添加索引,提高后续查询效率
CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
Added 1 constraints
0 rows available after 205 ms, consumed after another 1 ms
导入用户User Node信息:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/basicinfo.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap( row[0]) AS info WHERE LENGTH(info["d"][info["id"]]) = 1
MERGE (u:User{id:info["id"]})
ON CREATE SET u += apoc.map.clean(info["d"][info["id"]], ['community','roles','cancelBakup'], [])
;
0 rows available after 99179 ms, consumed after another 0 ms
Added 500255 nodes, Set 18492458 properties, Added 500255 labels
共导入 500255 个用户信息,共耗时 99 秒,速度还不错,简单核查下结果。
neo4j> MATCH (n:User) RETURN COUNT(n);
+----------+
| COUNT(n) |
+----------+
| 500255 |
+----------+
1 row available after 10 ms, consumed after another 1 ms
导入用户follow relation信息:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///data/data.csv' AS row FIELDTERMINATOR '\t'
WITH row
WITH apoc.convert.fromJsonMap(apoc.text.join(row,",")) AS info
UNWIND info["d"] AS elem
WITH
info["type"] AS type,
CASE WHEN info["type"] = "Followee" THEN elem["followerId"] WHEN info["type"] = "Follower" THEN elem["follower"]["objectId"] END AS startId,
CASE WHEN info["type"] = "Followee" THEN elem["followee"]["objectId"] WHEN info["type"] = "Follower" THEN elem["followeeId"] END AS endId ,
elem["createdAtString"] AS createdAtString,
elem["updatedAtString"] AS updatedAtString
MATCH (startNode:User{id:startId}),(endNode:User{id:endId})
MERGE(startNode)-[r:follow]->(endNode)
ON CREATE SET r.createdAtString = createdAtString,
r.updatedAtString = updatedAtString
;
0 rows available after 746892 ms, consumed after another 0 ms
Created 7115107 relationships, Set 14230214 properties
共导入 7115107 条用户的关注follow关系,共耗时 746 秒,核查下结果。
neo4j> MATCH p=()-[r:follow]->() RETURN COUNT(r);
+----------+
| COUNT(r) |
+----------+
| 7115107 |
+----------+
1 row available after 7 ms, consumed after another 0 ms
导入完成后,整个数据库文件 5.03 GB
数据初探
言归正传,在数据备好后,对掘友信息选取一些重要维度进行分析。
掘友用户增量情况
// 每月新增用户和关系情况
MATCH (u) WHERE EXISTS(u.createdAt)
WITH apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM') AS createdAt, COUNT(1) AS num
WITH apoc.map.fromLists(COLLECT(createdAt), COLLECT(num)) AS userMap
MATCH ()<-[r]-()
WITH userMap,r.createdAtString AS createdAtString
WITH userMap,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM') AS createdAtString, COUNT(1) AS num
WITH createdAtString, num AS followNum ,userMap[createdAtString] AS userNum
ORDER BY createdAtString
WITH COLLECT({createdAtString:createdAtString, userNum:userNum ,followNum:followNum} ) AS infoList
UNWIND RANGE(0,LENGTH(infoList)-1) AS index
WITH infoList[index]["createdAtString"] AS createdAtString,
infoList[index]["userNum"] AS userNum,
infoList[index]["followNum"] AS followNum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["userNum"]]) AS userNumSum,
apoc.coll.sum([elem IN infoList[..index+1] | elem["followNum"]]) AS followNumSum
RETURN createdAtString,userNum,followNum,userNumSum,followNumSum,followNumSum/userNumSum AS followPerUser
ORDER BY createdAtString DESC
LIMIT 10
;
+----------------------------------------------------------------------------------------+
| createdAtString | userNum | followNum | userNumSum | followNumSum | followPerUser |
+----------------------------------------------------------------------------------------+
| "2019-12" | 583 | 3436 | 496639.0 | 7115107.0 | 14.326516846240429 |
| "2019-11" | 10098 | 172500 | 496056.0 | 7111671.0 | 14.336427742029127 |
| "2019-10" | 11957 | 193671 | 485958.0 | 6939171.0 | 14.279363648710383 |
| "2019-09" | 11995 | 202601 | 474001.0 | 6745500.0 | 14.230982635057732 |
| "2019-08" | 12585 | 210558 | 462006.0 | 6542899.0 | 14.16193512638364 |
| "2019-07" | 14042 | 228781 | 449421.0 | 6332341.0 | 14.089998019674203 |
| "2019-06" | 11522 | 196869 | 435379.0 | 6103560.0 | 14.018958195043858 |
| "2019-05" | 12924 | 223838 | 423857.0 | 5906691.0 | 13.935574969860118 |
| "2019-04" | 13592 | 233482 | 410933.0 | 5682853.0 | 13.829147330586737 |
| "2019-03" | 15747 | 283769 | 397341.0 | 5449371.0 | 13.714595272071092 |
+----------------------------------------------------------------------------------------+
10 rows available after 55237 ms, consumed after another 0 ms


从分析结果来看,掘友用户增一直保持着很稳定的态势。同时,随着用户的增多,掘友之间的关系也相对增多,也体现了用户之间的交互更加密切,是用户用户活跃度的简介体现。
元老掘友分析
来看下掘友圈,那些跟随者掘金一起成长的最早的一批元老们。
// 粉丝数、最早注册等统计, 每日新增粉丝
MATCH (u:User)<-[r]-() WHERE EXISTS(u.createdAt) AND EXISTS(r.createdAtString)
WITH u,r.createdAtString AS createdAtString
WITH u,apoc.date.format(apoc.date.fromISO8601(createdAtString),'ms','yyyy-MM-dd') AS createdAtString, COUNT(1) AS num
WITH u, MAX(num ) AS maxNum
RETURN u.username,
apoc.date.format(apoc.date.fromISO8601(u.createdAt),'ms','yyyy-MM-dd') AS createdAt,
apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS days,
u.followersCount AS followersCount,
u.followersCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS aveNum,
maxNum
ORDER BY createdAt
// ORDER BY aveNum DESC
// ORDER BY maxNum DESC
LIMIT 5
;
- 最早注册掘金的元老们:
+-----------------------------------------------------------------------+
| u.username | createdAt | days | followersCount | aveNum | maxNum |
+-----------------------------------------------------------------------+
| "Ming_Zhe" | "2015-03-24" | 1719 | 20 | 0 | 3 |
| "代码家" | "2015-03-24" | 1719 | 140 | 0 | 6 |
| "稀土君" | "2015-04-02" | 1709 | 52904 | 30 | 308 |
| "江昪" | "2015-04-02" | 1710 | 9110 | 5 | 68 |
| "阴明" | "2015-04-02" | 1709 | 74262 | 43 | 286 |
| "fskslafhd" | "2015-04-15" | 1697 | 8 | 0 | 2 |
| "荆全齐" | "2015-04-16" | 1696 | 96 | 0 | 4 |
| "晓风well" | "2015-04-16" | 1696 | 192 | 0 | 6 |
| "开源小组" | "2015-04-16" | 1696 | 43 | 0 | 3 |
| "Millie_Lin" | "2015-04-16" | 1696 | 7744 | 4 | 67 |
+-----------------------------------------------------------------------+
- 最吸粉的大佬们(平均每日新增粉丝),平均一天吸粉近60。
+--------------------------------------------------------------------------+
| u.username | createdAt | days | followersCount | aveNum | maxNum |
+--------------------------------------------------------------------------+
| "石杉的架构笔记" | "2018-11-05" | 396 | 22877 | 57 | 189 |
| "刘小夕" | "2019-02-12" | 298 | 17086 | 57 | 326 |
| "闲鱼技术" | "2018-04-03" | 613 | 27634 | 45 | 157 |
| "ConardLi" | "2018-11-13" | 389 | 17373 | 44 | 285 |
| "美团技术团队" | "2018-03-30" | 617 | 27576 | 44 | 283 |
| "阴明" | "2015-04-02" | 1709 | 74262 | 43 | 286 |
| "Java3y" | "2018-01-30" | 676 | 29522 | 43 | 144 |
| "腾讯云加社区" | "2017-02-24" | 1016 | 44415 | 43 | 297 |
| "小姐姐味道" | "2018-10-30" | 403 | 17025 | 42 | 157 |
| "ikoala" | "2019-06-01" | 188 | 7550 | 40 | 134 |
+--------------------------------------------------------------------------+
10 rows available after 55589 ms, consumed after another 0 ms
掘友粉丝分布
// 粉丝分布
MATCH (u:User)
WHERE EXISTS(u.followersCount)
RETURN COUNT(u.followersCount) AS count,
AVG(u.followersCount) AS ave,
percentileDisc(u.followersCount, 0.5) AS `50%`,
percentileDisc(u.followersCount, 0.75) AS `75%`,
percentileDisc(u.followersCount, 0.90) AS `90%`,
percentileDisc(u.followersCount, 0.95) AS `95%`,
percentileDisc(u.followersCount, 0.99) AS `99%`,
percentileDisc(u.followersCount, 0.999) AS `99.9%`,
percentileDisc(u.followersCount, 0.9999) AS `99.99%`,
percentileDisc(u.followersCount, 0.99999) AS `99.999%`,
percentileDisc(u.followersCount, 1) AS `100%`
;
+---------------------------------------------------------------------------------------------+
| count | ave | 50% | 75% | 90% | 95% | 99% | 99.9% | 99.99% | 99.999% | 100% |
+---------------------------------------------------------------------------------------------+
| 499934 | 15.86762652670079 | 0 | 0 | 1 | 2 | 58 | 3721 | 17025 | 37657 | 74262 |
+---------------------------------------------------------------------------------------------+
1 row available after 4029 ms, consumed after another 0 ms
可以看到超过75%的掘友还没有粉丝额。要想排在掘友 top 1 内,也至少有58个粉丝。再看下 top1内的大佬吸粉情况:
// 粉丝占比
MATCH (u:User)
WHERE EXISTS(u.followersCount)
WITH
SUM(CASE WHEN u.followersCount <=60 THEN u.followersCount ELSE 0 END) AS low,
SUM(CASE WHEN u.followersCount >60 THEN u.followersCount ELSE 0 END) AS high,
SUM(u.followersCount) AS followersCountSum
RETURN low,high,followersCountSum,
low*1.0/followersCountSum AS lowPercent,
high*1.0/followersCountSum AS highPercent
;
+---------------------------------------------------------------------------------+
| low | high | followersCountSum | lowPercent | highPercent |
+---------------------------------------------------------------------------------+
| 210930 | 7721836 | 7932766 | 0.02658971662595367 | 0.9734102833740463 |
+---------------------------------------------------------------------------------+
1 row available after 1399 ms, consumed after another 0 ms
果然,“97% 的粉丝都掌握在 1% 的 大佬手中”,掘友的粉丝贫富差距肉眼可见。
那些高产的大佬们
// 发布文章最多的,发布沸点最多的
MATCH (u:User) WHERE
// EXISTS(u.pinCount)
EXISTS(u.postedPostsCount)
RETURN u.id AS id,
u.username AS username,
u.pinCount AS pinCount,
u.pinCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS pinCountAve,
u.postedPostsCount/apoc.date.convert( timestamp()+3600*8*1000- apoc.date.fromISO8601(u.createdAt), 'ms', 'd') AS postedPostsCountAve,
u.totalCollectionsCount AS totalCollectionsCount,
u.postedPostsCount AS postedPostsCount,
u.totalViewsCount AS totalViewsCount
ORDER BY postedPostsCount DESC
LIMIT 10
;
- 发布文章最多的大佬们:
+----------------------------------------------------------------------------+
| username | postedPostsCountAve | postedPostsCount | totalViewsCount |
+----------------------------------------------------------------------------+
| "乌云知识库" | 1 | 1224 | 18030 |
| "刘旷" | 1 | 1066 | 8336 |
| "千锋JAVA开发" | 1 | 996 | 23818 |
| "阿里云云栖社区" | 0 | 932 | 152109 |
| "芊宝宝" | 1 | 922 | 19598 |
| "LLrZqSg3" | 2 | 913 | 506 |
| "go4it" | 0 | 780 | 153028 |
| "JerryWang_sap" | 1 | 771 | 7178 |
| "腾讯云加社区" | 0 | 759 | 685628 |
| "云计算百科" | 1 | 754 | 6543 |
+----------------------------------------------------------------------------+
10 rows available after 1745 ms, consumed after another 0 ms
- 发布沸点最多的大佬们
+--------------------------------------------------------------+
| username | pinCount | pinCountAve | totalViewsCount |
+--------------------------------------------------------------+
| "网路冷眼" | 19397 | 21 | NULL |
| "湾区日报" | 6836 | 10 | 0 |
| "神奇的命令行" | 1989 | 1 | 108035 |
| "程序员趣事" | 1583 | 2 | 0 |
| "树洞robot" | 1530 | 4 | 0 |
| "HackerNews什么值 | 1510 | 4 | 0 |
| "爱可可-爱生活" | 1484 | 4 | 58 |
| "科技新闻搬运工" | 1476 | 2 | 0 |
| "娱乐小编" | 1283 | 2 | 2761 |
| "王兴的饭否" | 1264 | 1 | 0 |
+--------------------------------------------------------------+
10 rows available after 726 ms, consumed after another 1 ms
影响力分析挖掘
前面对掘友圈进行了粗略分析,粉丝数、文章数等都是单维度的,那么最有影响力掘友该如何分析,采用pageRank进行建模,如下
// pageRank 影响力挖掘分析
MATCH(u0:User{id:'551d677ee4b0cd5b623f49cb'})
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH u0,algo.getNodeById(nodeId) AS u, score
WITH u,score, CASE WHEN EXISTS((u0)-[:follow]-(u)) THEN 1 ELSE 0 END AS hasRelation
RETURN
u.username AS username,
score,hasRelation
ORDER BY score DESC
LIMIT 20
;
+---------------------------------------------------+
| username | score | hasRelation |
+---------------------------------------------------+
| "阴明" | 5492.921815348415 | 0 |
| "HollisChuang" | 3675.980443021097 | 1 |
| "漫话编程" | 3543.827678073035 | 1 |
| "超人汪小建" | 3378.3619996590537 | 1 |
| "稀土君" | 2878.5440236374734 | 1 |
| "stormzhangV" | 2855.733890181873 | 0 |
| "前端外刊评论" | 2841.045798705519 | 1 |
| "liutao" | 2481.8655670162293 | 1 |
| "膜法小编" | 2047.3083251186647 | 1 |
| "LucasHC" | 2030.568352384912 | 0 |
| "NeXT" | 1922.9859172880645 | 1 |
| "李CHENGXI" | 1752.8077864374964 | 1 |
| "水墨寒" | 1528.3845290698114 | 1 |
| "清蒸不是水煮" | 1471.3738727076911 | 1 |
| "蚂蚁金服数据.." | 1287.2166685772129 | 1 |
| "丁一" | 1252.5015809553672 | 1 |
| "美团技术团队" | 1250.5822049211245 | 1 |
| "江昪" | 1234.908881070744 | 1 |
| "腾讯云加社区" | 1169.060374834016 | 1 |
| "薄荷前端" | 1093.3464047224304 | 0 |
+---------------------------------------------------+
20 rows available after 5093 ms, consumed after another 0 ms

从结果来看,站长 阴明 的影响力还是毋庸置疑的,遥遥领先其他掘友。且排在前面的用户,大部分都跟站长有直接的关(hasRelation=1)系。这忧郁的眼神,唏嘘的胡渣子。也不知道大家是喜欢站长,还是喜欢站长的女装。 ^_^
// pageRank 影响力前10 的用户之间的关系
CALL algo.pageRank.stream('User', 'follow', {iterations:20, dampingFactor:0.85}) YIELD nodeId, score
WITH algo.getNodeById(nodeId) AS u, score
WITH u,score ORDER BY score DESC LIMIT 10
WITH COLLECT(u) AS Users
UNWIND Users AS u1
UNWIND Users AS u2
MATCH p=(u1)-[r:follow]-(u2)
RETURN p;

影响力前10名的的大佬之间,错综复杂,剪不断理还乱的关系,果然,大佬们都是在玩圈子啊。
互粉情况分析
话说,互粉是中华民族的传统美德,来看下,哪些大拿和粉丝互粉互动较多呢?
// 互粉的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
RETURN COUNT(DISTINCT u1.id+u2.id) AS num
;
+-------+
| num |
+-------+
| 22732 |
+-------+
1 row available after 32644 ms, consumed after another 0 ms
可以看出,掘友圈友超过22万互粉用户。
// 自我关注的情况
MATCH (u1:User)<-[:follow]-(u2:User) WHERE id(u1)=id(u2)
RETURN COUNT(DISTINCT u1.id) AS num
;
+-----+
| num |
+-----+
| 147 |
+-----+
1 row available after 0 ms, consumed after another 1351 ms
当然,也不乏自我互粉的掘友,你是这147位掘友中的一员吗。
// 互粉最多的用户
MATCH (u1:User)<-[:follow]-(u2:User)// WHERE id(u1)>id(u2)
MATCH (u1)-[:follow]->(u2)
WITH u1,u2
WITH u1,COUNT(u2) AS num
RETURN u1.username,num
ORDER BY num DESC
LIMIT 20
;
+--------------------------+
| u1.username | num |
+--------------------------+
| "liutao" | 641 |
| "膜法小编" | 449 |
| "阴明" | 367 |
| "黑马UI" | 322 |
| "pilishen" | 315 |
| "拥抱心中的梦想" | 296 |
| "powerzhuye" | 260 |
| "zhennann" | 237 |
| "Mockplus" | 230 |
| "MarvinZhang" | 184 |
| "清蒸不是水煮" | 176 |
| "一颗香菜" | 154 |
| "断天涯大虾" | 152 |
| "爱原型爱设计" | 151 |
| "闻人的技术博客" | 141 |
| "bytedance" | 139 |
| "Bug开发者" | 136 |
| "江昪" | 123 |
| "NervosNetwork" | 118 |
+--------------------------+
20 rows available after 23 ms, consumed after another 35454 ms
以上大佬对粉丝很热情额,大家可以多多关注,说不定能得到大佬的互粉青睐。
本次初步分析至此,基于该数据的持续深入分析,见后续文章。
注: 分析代码见 github.com/saiwaiyanyu…
此次数据仅做分析学习使用。不分享,不开源,不用做任何商业用途。