====================以下为6.25修改==================
爬取了28w条歌手数据,然后根据歌手数据爬取专辑数据。
其中23.5w条欧美歌手数据。
欧美数据太多了,查专辑的时候省略掉欧美歌手的数据。
====================以下为6.22修改==================
修改:
爬虫发现歌手页面有个loading界面,歌手信息是在loading之后才展示的,不知道怎么处理这个延迟,所以扑街。。。最终还是要用抓包方式。
抓包的最终结果分析如下:
public static String getUrl(SingerSearchConstants.Sex sex,SingerSearchConstants.Area area,int pageNum){
String data = "data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A"
+area.getArea()
+"%2C%22sex%22%3A"
+sex.getSex()
+"%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A"+PAGE_SIZE*(pageNum-1)+"%2C%22cur_page%22%3A"+pageNum+"%7D%7D%7D";
return SingerSearchConstants.singerSearchUrl + data;
}
其中性别:
public enum Sex {
MALE(0,"男"),
FEMALE(1,"女"),
BAND(2,"组合");
}
地区:
public enum Area {
NEIDI(200,"内地"),
GANGTAI(2,"港台"),
OUMEI(5,"欧美"),
RIBEN(4,"日本"),
HANGUO(3,"韩国"),
QITA(6,"其他");
// ALL(-100);
}
可以拿到所有歌手信息啦。然后丢给后台去跑了
=====================原文==============================
1、最初设想通过抓取http请求来获取歌手信息,但是发现
https://u.y.qq.com/cgi-bin/musicu.fcg?callback=getUCGI0300&g_tk=5381&jsonpCallback=getUCGI0300&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8¬ice=0&platform=yqq&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D
最后一个字段是各种算法弄出来的:
data:
%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A160%2C%22cur_page%22%3A3%7D%7D%7D
太复杂了,不如直接抓网页中的歌手信息。
下面就要开始爬啦!
但是还不会java爬虫,没找到什么合适的框架,先去找框架先,溜了溜了.......