爬取xx音乐所有歌手以及歌手id

628 阅读2分钟

====================以下为6.25修改==================

爬取了28w条歌手数据,然后根据歌手数据爬取专辑数据。

其中23.5w条欧美歌手数据。

欧美数据太多了,查专辑的时候省略掉欧美歌手的数据。



====================以下为6.22修改==================

修改:

爬虫发现歌手页面有个loading界面,歌手信息是在loading之后才展示的,不知道怎么处理这个延迟,所以扑街。。。最终还是要用抓包方式。

抓包的最终结果分析如下:

public static String getUrl(SingerSearchConstants.Sex sex,SingerSearchConstants.Area area,int pageNum){

    String data = "data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A"
            +area.getArea()
            +"%2C%22sex%22%3A"
            +sex.getSex()
            +"%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A"+PAGE_SIZE*(pageNum-1)+"%2C%22cur_page%22%3A"+pageNum+"%7D%7D%7D";

    return SingerSearchConstants.singerSearchUrl + data;
}

其中性别:

public enum Sex {
    MALE(0,"男"),
    FEMALE(1,"女"),
    BAND(2,"组合");
}

地区:

 public enum Area {
        NEIDI(200,"内地"),
        GANGTAI(2,"港台"),
        OUMEI(5,"欧美"),
        RIBEN(4,"日本"),
        HANGUO(3,"韩国"),
        QITA(6,"其他");
//        ALL(-100);
}

可以拿到所有歌手信息啦。然后丢给后台去跑了

=====================原文==============================


1、最初设想通过抓取http请求来获取歌手信息,但是发现

https://u.y.qq.com/cgi-bin/musicu.fcg?callback=getUCGI0300&g_tk=5381&jsonpCallback=getUCGI0300&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D

最后一个字段是各种算法弄出来的:

data:

%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D


太复杂了,不如直接抓网页中的歌手信息。


下面就要开始爬啦!

但是还不会java爬虫,没找到什么合适的框架,先去找框架先,溜了溜了.......