手把手学习数据分析(一)

290 阅读7分钟

数据获取

任何的机器学习算法解决问题,首先要考虑的是数据,数据从何而来?

  • 对于酷狗音乐/网易音乐这样的企业而言,用户的收藏和播放数据是可以直接获得的
  • 我们找一个取巧的方式,包含用户音乐兴趣信息,同时又可以获取的数据是什么?
    • 对的,是歌单信息

数据说明

自己手动爬取了网易云音乐中80w首歌400w+次收藏的歌单,数据获取的过程不赘述,感兴趣的宝宝看一看爬虫的知识(使用scrapy框架),存储格式为json格式,数据大小为3.59G,格式说明如下:

每个歌单的格式

{
    "result": {
        "id": 111450065,
        "status": 0,
        "commentThreadId": "A_PL_0_111450065",
        "trackCount": 120,
        "updateTime": 1460164523907,
        "commentCount": 227,
        "ordered": true,
        "anonimous": false,
        "highQuality": false,
        "subscribers": [],
        "playCount": 687070,
        "trackNumberUpdateTime": 1460164523907,
        "createTime": 1443528317662,
        "name": "带本书去旅行吧,人生最美好的时光在路上。",
        "cloudTrackCount": 0,
        "shareCount": 149,
        "adType": 0,
        "trackUpdateTime": 1494134249465,
        "userId": 39256799,
        "coverImgId": 3359008023885470,
        "coverImgUrl": "http://p1.music.126.net/2ZFcuSJ6STR8WgzkIi2U-Q==/3359008023885470.jpg",
        "artists": null,
        "newImported": false,
        "subscribed": false,
        "privacy": 0,
        "specialType": 0,
        "description": "现在是一年中最美好的时节,世界上很多地方都不冷不热,有湛蓝的天空和清冽的空气,正是出游的好时光。长假将至,你是不是已经收拾行装准备出发了?行前焦虑症中把衣服、洗漱用品、充电器之类东西忙忙碌碌地丢进箱子,打进背包的时候,我打赌你肯定会留个位置给一位好朋友:书。不是吗?不管是打发时间,小读怡情,还是为了做好攻略备不时之需,亦或是为了小小地装上一把,你都得有一本书傍身呀。读大仲马,我是复仇的伯爵;读柯南道尔,我穿梭在雾都的暗夜;读村上春树,我是寻羊的冒险者;读马尔克斯,目睹百年家族兴衰;读三毛,让灵魂在撒哈拉流浪;读老舍,嗅着老北京的气息;读海茵莱茵,于科幻狂流遨游;读卡夫卡,在城堡中审判……读书的孩子不会孤单,读书的孩子永远幸福。",
        "subscribedCount": 10882,
        "totalDuration": 0,
        "tags": [
            "旅行",
            "钢琴",
            "安静"]
        "creator": {
            "followed": false,
            "remarkName": null,
            "expertTags": [
                "古典",
                "民谣",
                "华语"
            ],
            "userId": 39256799,
            "authority": 0,
            "userType": 0,
            "gender": 1,
            "backgroundImgId": 3427177752524551,
            "city": 360600,
            "mutual": false,
            "avatarUrl": "http://p1.music.126.net/TLRTrJpOM5lr68qJv1IyGQ==/1400777825738419.jpg",
            "avatarImgIdStr": "1400777825738419",
            "detailDescription": "",
            "province": 360000,
            "description": "",
            "birthday": 637516800000,
            "nickname": "有梦人生不觉寒",
            "vipType": 0,
            "avatarImgId": 1400777825738419,
            "defaultAvatar": false,
            "djStatus": 0,
            "accountStatus": 0,
            "backgroundImgIdStr": "3427177752524551",
            "backgroundUrl": "http://p1.music.126.net/LS96S_6VP9Hm7-T447-X0g==/3427177752524551.jpg",
            "signature": "漫无目的的乱听,听着,听着,竟然灵魂出窍了。更多精品音乐美图分享请加我微信hu272367751。微信是我的精神家园,有我最真诚的分享。",
            "authStatus": 0}
        "tracks": [{歌曲1},{歌曲2}, ...]
     }
}

每首歌曲的格式

{
	"id": 29738501,
	"name": "跟着你到天边 钢琴版",
	"duration": 174001,
	"hearTime": 0,
	"commentThreadId": "R_SO_4_29738501",
	"score": 40,
	"mvid": 0,
	"hMusic": null,
	"disc": "",
	"fee": 0,
	"no": 1,
	"rtUrl": null,
	"ringtone": null,
	"rtUrls": [],
	"rurl": null,
	"status": 0,
	"ftype": 0,
	"mp3Url": "http://m2.music.126.net/vrVa20wHs8iIe0G8Oe7I9Q==/3222668581877701.mp3",
	"audition": null,
	"playedNum": 0,
	"copyrightId": 0,
	"rtype": 0,
	"crbt": null,
	"popularity": 40,
	"dayPlays": 0,
	"alias": [],
	"copyFrom": "",
	"position": 1,
	"starred": false,,
	"starredNum": 0
	"bMusic": {
	    "name": "跟着你到天边 钢琴版",
	    "extension": "mp3",
	    "volumeDelta": 0.0553125,
	    "sr": 44100,
	    "dfsId": 3222668581877701,
	    "playTime": 174001,
	    "bitrate": 96000,
	    "id": 52423394,
	    "size": 2089713
	},
	"lMusic": {
	    "name": "跟着你到天边 钢琴版",
	    "extension": "mp3",
	    "volumeDelta": 0.0553125,
	    "sr": 44100,
	    "dfsId": 3222668581877701,
	    "playTime": 174001,
	    "bitrate": 96000,
	    "id": 52423394,
	    "size": 2089713
	},
	"mMusic": {
	    "name": "跟着你到天边 钢琴版",
	    "extension": "mp3",
	    "volumeDelta": -0.000265076,
	    "sr": 44100,
	    "dfsId": 3222668581877702,
	    "playTime": 174001,
	    "bitrate": 128000,
	    "id": 52423395,
	    "size": 2785510
	},
	"artists": [
	    {
		"img1v1Url": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		"name": "群星",
		"briefDesc": "",
		"albumSize": 0,
		"img1v1Id": 0,
		"musicSize": 0,
		"alias": [],
		"picId": 0,
		"picUrl": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		"trans": "",
		"id": 122455
	    }
	],
	"album": {
	    "id": 3054006,
	    "status": 2,
	    "type": null,
	    "tags": "",
	    "size": 69,
	    "blurPicUrl": "http://p1.music.126.net/2XLMVZhzVZCOunaRCOQ7Bg==/3274345629219531.jpg",
	    "copyrightId": 0,
	    "name": "热门华语248",
	    "companyId": 0,
	    "songs": [],
	    "description": "",
	    "pic": 3274345629219531,
	    "commentThreadId": "R_AL_3_3054006",
	    "publishTime": 1388505600004,
	    "briefDesc": "",
	    "company": "",
	    "picId": 3274345629219531,
	    "alias": [],
	    "picUrl": "http://p1.music.126.net/2XLMVZhzVZCOunaRCOQ7Bg==/3274345629219531.jpg",
	    "artists": [
		{
		    "img1v1Url": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		    "name": "群星",
		    "briefDesc": "",
		    "albumSize": 0,
		    "img1v1Id": 0,
		    "musicSize": 0,
		    "alias": [],
		    "picId": 0,
		    "picUrl": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		    "trans": "",
		    "id": 122455
		}
	    ],
	    "artist": {
		"img1v1Url": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		"name": "",
		"briefDesc": "",
		"albumSize": 0,
		"img1v1Id": 0,
		"musicSize": 0,
		"alias": [],
		"picId": 0,
		"picUrl": "http://p1.music.126.net/6y-UleORITEDbvrOLV0Q8A==/5639395138885805.jpg",
		"trans": "",
		"id": 0
	    }
	}
}

数据解析

给大家原始数据和这份数据说明的原因是:里面包含非常多的信息(风格,歌手,歌曲播放次数,歌曲时长,歌曲发行时间...),大家思考后一定会想到如何使用它们进一步完善推荐系统
我们这里依旧使用最基础的音乐信息,我们认为同一个歌单中的歌曲,有比较高的相似性,同时都是做单的同学喜欢的。

原始数据=>歌单数据

抽取 歌单名称,歌单id,收藏数,所属分类 4个歌单维度的信息
抽取 歌曲id,歌曲名,歌手,歌曲热度 等4个维度信息歌曲的信息

组织成如下格式:

漫步西欧小镇上##小语种,旅行##69413685##474	18682332::Wäg vo dir::Joy Amelie::70.0	4335372::Only When I Sleep::The Corrs::60.0	2925502::Si Seulement::Lynnsha::100.0	21014930::Tu N'As Pas Cherché...::La Grande Sophie::100.0	20932638::Du behöver aldrig mer vara rädd::Lasse Lindh::25.0	17100518::Silent Machine::Cat Power::60.0	3308096::Kor pai kon diew : ชอไปคนเดียว::Palmy::5.0	1648250::les choristes::Petits Chanteurs De Saint Marc::100.0	4376212::Paddy's Green Shamrock Shore::The High Kings::25.0	2925400::A Todo Color::Las Escarlatinas::95.0	19711402::Comme Toi::Vox Angeli::75.0	3977526::Stay::Blue Cafe::100.0	2538518::Shake::Elize::85.0	2866799::Mon Ange::Jena Lee::85.0	5191949::Je M'appelle Helene::Hélène Rolles::85.0	20036323::Ich Lieb' Dich Immer Noch So Sehr::Kate & Ben::100.0
#coding: utf-8
import json
import sys

def parse_song_line(in_line):
    data = json.loads(in_line)
    name = data['result']['name']
    tags = ",".join(data['result']['tags'])
    subscribed_count = data['result']['subscribedCount']
    if(subscribed_count<100):
        return False
    playlist_id = data['result']['id']
    song_info = ''
    songs = data['result']['tracks']
    for song in songs:
        try:
            song_info += "\t"+":::".join([str(song['id']),song['name'],song['artists'][0]['name'],str(song['popularity'])])
        except Exception as e:
            continue
    return name+"##"+tags+"##"+str(playlist_id)+"##"+str(subscribed_count)+song_info

def parse_file(in_file, out_file):
#     out = open(out_file, 'w', encoding='utf-8')
    for line in open(in_file, encoding='utf-8'):
        result = parse_song_line(line)
#         print(result)
        if(result):
            pass
#             out.write(str(result.encode('utf-8').strip())+"\n")
        break
#     out.close()

parse_file(path+"/playlistdetail.all.json", "./163_music_playlist.txt")

歌单数据=>推荐系统格式数据

主流的python推荐系统框架,支持的最基本数据格式为movielens dataset,其评分数据格式为 user item rating timestamp,为了简单,我们也把数据处理成这个格式。

#coding: utf-8
import json
import sys


def is_null(s): 
    return len(s.split(","))>=2

def parse_song_info(song_info):
    try:
#         print(song_info.split(':::'))
        song_id, name, artist, popularity = song_info.split("::")
#         print('song info : ', song_id)
        return ",".join([song_id,"1.0",'1300000'])
    except Exception as e:
        print('song error:',e)
        return ""

def parse_playlist_line(in_line):
    try:
        contents = in_line.strip().split('\t')
        name, tags, playlist_id, subscribed_count = contents[0].split("##")
        songs_info = map(lambda x:playlist_id+","+parse_song_info(x), contents[1:])      
        songs_info = filter(is_null, songs_info)
        info = "\n".join(songs_info)
        return info
    except Exception as e:
        print('play:',e)
        return False


def parse_file(in_file, out_file):
#     out = open(out_file, 'w')
    for line in open(in_file):
        result = parse_playlist_line(line)
        if(result):
            pass
#             out.write(str(result.encode('utf-8').strip())+"\n")
#     out.close()

parse_file(path+"/163_music_playlist-1.txt", "./163_music_suprise_format.txt")

parse_file(path+"/popular.playlist", "./popular_music_suprise_format.txt")

保存歌单和歌曲信息备用

我们需要保存 歌单id=>歌单名歌曲id=>歌曲名 的信息后期备用。

#coding: utf-8
import _pickle as pickle
import sys
import numpy as np

def parse_playlist_get_info(in_line, playlist_dic, song_dic):
#     name, tags, playlist_id, subscribed_count = None,None,None,None
    contents = in_line.strip().split(r'\t')
    try:
        name, tags, playlist_id, subscribed_count = contents[0].split("##")
        playlist_dic[playlist_id] = name
        for song in contents[1:]:
            try:
                song_id, song_name, artist, popularity = song.split(":::")
                song_dic[song_id] = song_name+"\t"+artist
#                 print('hehe')
            except:
#                 print("song format error")
#                 print(song+"\n")
                pass

    except:
#         print(contents[0])
        pass

def parse_file(in_file, out_playlist, out_song):
    #从歌单id到歌单名称的映射字典
    playlist_dic = {}
    #从歌曲id到歌曲名称的映射字典
    song_dic = {}
    for line in open(in_file):
        parse_playlist_get_info(line, playlist_dic, song_dic)
    #把映射字典保存在二进制文件中
#     pickle.dump(playlist_dic, open(out_playlist,"wb")) 
    #可以通过 playlist_dic = pickle.load(open("playlist.pkl","rb"))重新载入
#     pickle.dump(song_dic, open(out_song,"wb"))

parse_file(path+"/163_music_playlist.txt", "playlist.pkl", "song.pkl")

parse_file(path+"./popular.playlist", "popular_playlist.pkl", "popular_song.pkl")