提要
近期需要通过业务日志,统计出业务使用数据的分布情况,业务日志的数据格式如下,数据量大约是三百万。想借助hive的数据处理能力,通过解析url字段中的starttime字段,并通过对starttime聚合,分析出使用的数据分布情况。由于案例样本比较大,样本不方便提供,本文章只使用下面提供的数据样本。
数据样本
简化版的案例数据
3704429781c54f7e940e53b10973e44b,2020-10-26T05:05:10.88Z,127.0.0.1,"http://localhost/es/search/native?statement={""index"":[""app1_retention""],""ignore_unavailable"":true}
{""size"":1,""sort"":[{""wx_wci"":{""order"":""desc""}}],""query"": {""bool"": {""must"": {""bool"": {""must"": [{""match_phrase"": {""news_title"": ""新未成年人保护法来了""}}]}}}},""_source"": []}&startTime=1602479109000&endTime=1603688709000",iZbp1d6qlouk35su365hwzZ,200 OK,清博指数,316,/es/search/native
afd25001aee74f5e94409750b64c383d,2020-10-26T05:05:11.078Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:journal AND news_posttime:[2019-04-12T00:00:00 TO 2019-04-13T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1554998400000&endTime=1555084800000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,172,/es/search/scroll
3c8b14238fe94ce1ab02088ec881e605,2020-10-26T05:05:11.194Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""侯杰"" AND news_title:""豪杰"") OR (news_content:""侯杰"" AND news_content:""豪杰"")) AND news_media:wx AND news_posttime:[2020-02-26T00:00:00 TO 2020-02-27T00:00:00}&size=500&repo=weixin1_retention&fields=news_media,wx_name,app_name,news_channel,news_title,news_digest,wx_news_position,news_positive,news_posttime,news_url,content_city,content_province,media_type,news_is_origin,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1582646400000&endTime=1582732800000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,139,/es/search/scroll
ad433ae4710c4fb9b4f09f5987dbace1,2020-10-26T05:05:11.208Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:journal AND news_posttime:[2019-04-13T00:00:00 TO 2019-04-14T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555084800000&endTime=1555171200000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,129,/es/search/scroll
03e440b5905f4ac9b0074c0923191aea,2020-10-26T05:05:11.264Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""汪奇魔"" AND news_title:""魔术"") OR (news_content:""汪奇魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-12-31T00:00:00 TO 2020-01-01T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1577721600000&endTime=1577808000000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,135,/es/search/scroll
fd44e8f6e2c141eb8c5a36da3d9e64c5,2020-10-26T05:05:11.48Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:journal AND news_posttime:[2019-04-15T00:00:00 TO 2019-04-16T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555257600000&endTime=1555344000000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,153,/es/search/scroll
445c5b48694745c196be3f59102b677b,2020-10-26T05:05:11.574Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""侯杰"" AND news_title:""豪杰"") OR (news_content:""侯杰"" AND news_content:""豪杰"")) AND news_media:wx AND news_posttime:[2020-02-27T00:00:00 TO 2020-02-28T00:00:00}&size=500&repo=weixin1_retention&fields=news_media,wx_name,app_name,news_channel,news_title,news_digest,wx_news_position,news_positive,news_posttime,news_url,content_city,content_province,media_type,news_is_origin,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1582732800000&endTime=1582819200000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,373,/es/search/scroll
f8f66d478bde4690a3964fbc1de566dc,2020-10-26T05:05:11.686Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""汪小魔"" AND news_title:""魔术"") OR (news_content:""汪小魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-01-04T00:00:00 TO 2019-01-05T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&scrollId=DnF1ZXJ5VGhlbkZldGNoAgAAAAABeMY2Fm9zbzUtNWc0VGxhcUd2TkpESHcwTFEAAAAAAR4AYRZQUkEwUFgwOFFDdVdSUnM1QUZQZGJR&startTime=1546531200000&endTime=1546617600000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,72,/es/search/scroll
68611f445fb54d269148825cdb43afdb,2020-10-26T05:05:11.889Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:journal AND news_posttime:[2019-04-18T00:00:00 TO 2019-04-19T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555516800000&endTime=1555603200000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,125,/es/search/scroll
4a71ee5357174a43919e07c4cafb9eb7,2020-10-26T05:05:11.991Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""汪小魔"" AND news_title:""魔术"") OR (news_content:""汪小魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-01-09T00:00:00 TO 2019-01-10T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1546963200000&endTime=1547049600000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,58,/es/search/scroll
c30779e6a52f4f6f966f7f04af47b66f,2020-10-26T05:05:12.052Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:forum AND news_posttime:[2019-04-19T00:00:00 TO 2019-04-20T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555603200000&endTime=1555689600000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,155,/es/search/scroll
225d160fe286493a9eb1dd7596507a70,2020-10-26T05:05:12.073Z,127.0.0.1,"http://localhost/es/search/msearch?statement={ ""size"": 1, ""query"": {""bool"":{""filter"":[{""range"":{""news_posttime"":{""gt"":""2020-10-25 00:00:00"",""lt"":""2020-10-25 23:59:59""}}},{""terms"":{""wx_biz"":[""MjM5NjE4MDk4MQ=="",""MzA5MjEwNjMwMw=="",""MzA3MjAzMDMwMg=="",""MjM5MDEyOTAwMA=="",""MzA5NTk2MjM4NQ=="",""MzI0NTAzMjU1OA=="",""MzI0OTMzMDY4Mg=="",""MzIxMDM1NjIzNQ=="",""MzUyOTE1MzEwMw=="",""MzIxOTU2MDEzOQ=="",""MzI3MjAwMjI2OA=="",""MjM5ODc1MjU2MA=="",""MjM5ODc2MzYwMQ=="",""MjM5NDI3MjkyNQ=="",""MjM5NzQ5MTkyMA=="",""MzI5MDg4Njg2Nw=="",""MzA5MDc5NjgwNg=="",""MzAwMjI0NDgwMA=="",""MjM5MzM1MjA4MA=="",""MjM5NDU3Mjc4OA=="",""NDM5MzI1NQ=="",""MzA5MjM3MDE2NA=="",""MzAxMzI0MDM3Ng=="",""ODU5OTQ4ODgx"",""MzA5NjA3NzU1OQ=="",""MzIwMjE1NzU4Mg=="",""MzIyNzE3MjgyOQ=="",""MzIzOTc1NzA3OA=="",""MzI1MzM3MDg4NA=="",""MzU4MTA0NTg5MQ=="",""MzIyMjgyNTg4Mg=="",""MjM5MDQwMzc1NQ=="",""MjM5MjQ1NjExMg=="",""MzA5MDI3MjEzNQ=="",""MjM5ODE5MTY1OA=="",""MjM5MjEzNTY1Ng=="",""MjM5NDg4OTA3MQ=="",""MjM5NTA3Mjc0MA=="",""MzIyMTI5NjM2MQ=="",""MjM5NjQ4MzAyMg=="",""MzA4NjA4ODMzNQ=="",""MjM5MDI2NDkwMA=="",""MzI3MjA0OTM0NQ=="",""MzA5MjIwNzcyNQ=="",""MzI3NzAyMDgzOA=="",""MzA4OTUwMTU1NQ=="",""MzA4OTkxMjM0Ng=="",""MzIxMjU2OTk0MA=="",""MzI5MzQ2Mjc2Ng=="",""MzI4MTgxMTc2Ng=="",""MzI0Mzk2MjI2Ng=="",""MjM5MDk5NDAwMA=="",""MjM5OTE3OTgwMQ=="",""MjM5OTMzNDk2MA=="",""MjM5OTQ3ODA4MQ=="",""MjM5NTA4MDU4OA=="",""MzA3OTQ1MDEwMg=="",""MjM5ODI2Mjc4Mw=="",""MjM5ODI4NTE0MQ=="",""MzAxOTM1OTQ0NA=="",""MzA4ODYzMTkxNw=="",""MzA3Mzc0NzkzMw=="",""MzAxNDgzNjcwNw=="",""MzA5NzQzMTQ2Ng=="",""MzAwNDQwMjEzMQ=="",""MzIxMTE1MTI5NA=="",""MzI2MDI4MTg5Ng=="",""MzAxNzQxMTM2Ng=="",""MzI2ODg1ODMyMw=="",""MzIwNTk1MTI4Nw=="",""MjM5OTQ3MjM4MA=="",""MjM5Njc5MjQwMQ=="",""MzA3NzM0NTcxMA=="",""MzA3NTE3NTAwMA=="",""MjM5NDgxNDkyMw=="",""MzA4MDAyNzkzMw=="",""MzUyOTE3Mzk1Mg=="",""MzI1ODQxMzE1OA=="",""MjM5NjEwMTM5MQ=="",""MjM5ODM3NzI0NA=="",""MjM5OTY3NjcyMA=="",""MzA5MjAyNzUyMA=="",""MjM5MjY2MTYwMw=="",""MzI1MzM1NDQ5MQ=="",""MzA5OTI5OTEyNw=="",""MzA5MjU2MjQwNg=="",""MzA5MTc1OTAxNA=="",""MzA3MzI0OTE4OA=="",""MzIxNjQxNjcxMA=="",""MzAwMTY0ODM5MA=="",""MzI1MTM2MjQyNw=="",""MzI0NTQ0NDQwMA=="",""MzI2NjQyMzcwMw=="",""MjM5Nzg1NDMyMw=="",""MzA5OTE4NTQyMg=="",""MzA5ODA1ODgwOA=="",""MjM5MzI2NjMyMA=="",""MjM5MjY1MzQ1Mw=="",""MjM5NzQ3OTU0MA=="",""MjM5Njc5NjMyMA=="",""MjM5MTE3MzA3Mg=="",""MzA5OTAwMDMzNg=="",""MjM5OTA0NzI2MA=="",""MzI5NTEwOTA2NA=="",""MzA4NTMzMTU5Mg=="",""MzA4NjM0OTg0Ng=="",""MzU2MDE0NDE2Nw=="",""Mjg0NDcyODU4MA=="",""MTU5MzQ2Nzk0MQ=="",""MjM5OTU1MjI4Mw=="",""MzA3ODE3MzQyOQ=="",""MzA5MjU1ODIyMg=="",""MzAwNzE1ODA2OA=="",""MjM5ODAzODEyNQ=="",""MzA3MzgzOTMxNQ=="",""MzA4Mjc1NjkyNg=="",""MzI3NDU5MDE2Mw=="",""MjM5ODc2NTA1MQ=="",""MzA3MjQxODIyNQ=="",""MzA3NDczMzUwNw=="",""MzAxNTI5NzYzMA=="",""MzA5ODA1NzIwMg=="",""MjM5ODY4MzcwNA=="",""MjM5MDMwMjIyNQ=="",""MzA5NzAwNzMwOA=="",""MzAwNzAxODA4Mg=="",""MzA3MDQ1ODIyNQ=="",""MjM5NTU1ODEzNA=="",""MzA3NTQyMDA4Nw=="",""MzA4MjEzOTM0Mg=="",""MzI5MjA2NDk3NQ=="",""MzAwMDgyOTUzNQ=="",""MzI1MzE5ODIwNw=="",""MzI4NzIwOTEzNQ=="",""MzIxNTY0NjU0OA=="",""MzI3MTUxMzA4Mw=="",""MzI3MjYzNDYxNw=="",""MzU4MjExNTAzNg=="",""MzI5MDc5NDE4Mw=="",""MzIwNDk1MjUyNQ=="",""MjM5MTUxNTk0MA=="",""MjM5MjkxNjU5NA=="",""MjM5MTU2NzA1NA=="",""MjM5MzQ2ODM3Mg=="",""MzA3MTAzNDM2OA=="",""MzI2MTU4MzkwOA=="",""MzIyMDAxMDk0Mg=="",""MzA4MDM1ODgwMw=="",""MjM5NjQyNjUxNA=="",""MzA3MzU2NzgwMA=="",""MzA4MTQwMDcxMQ=="",""MzA4MDg3ODAzMw=="",""MzA3NDg1Mjg2NQ=="",""MzA4Mjc4MjE2MA=="",""MzAxNDQ4Njg5OA=="",""MzA4MjczMDM2Mg=="",""MzA3NjAxNjMyMA=="",""MzAxNTQ3Mjc2NA=="",""MjM5MDE2OTYxMA=="",""MzA3MDQ1MzUzOQ=="",""MjM5MTc3MTg4MA=="",""MjM5NzY3MDQ2MA=="",""MjM5OTgzOTY4Mw=="",""MzI3MjE4Mzc5Ng=="",""MjM5MDg2MTA4MA=="",""MjM5MzI4NjIwNA=="",""MzA4ODkyNTI0Mg=="",""MzA4MzkyNTIzOA=="",""MzU0MjAxMjM0NQ=="",""MjM5NzA3NjQwMA=="",""MzA4ODE0MjUxOQ=="",""MzI3MDQzNTI2OQ=="",""MjM5NzI1MTY0MQ=="",""MjM5MjM1ODA0MA=="",""MzA4MTEwODcyOQ=="",""MzA5NTI0MTgyNQ=="",""MjM5Njk4MTM2MQ=="",""MjM5NjM4ODQwMQ=="",""MjM5NzgyMjkwMA=="",""MjM5NDI5MTgyMA=="",""MzU0MTQ4NzQxNw=="",""MzA3NTA4NTIxOA=="",""MjM5MjQ1ODczMg=="",""MzIyMDExMjk0Nw=="",""MjM5ODU5ODU3NA=="",""MTI0MDU3NDYwMQ=="",""MzU1ODM3MTI1Ng=="",""MzIzMDc2MTcwOQ=="",""MTIyMzU2MzcwMQ=="",""MzU4MTQwMDIxOA=="",""MzI2ODk0NDgwOA=="",""MzU0NTU1OTMyOQ=="",""MzUyMTQ0OTUwNg=="",""MzU5MDM3NzkxNg=="",""MzI3NTc4NzU5MA=="",""MzUzNTY5MTM2MA=="",""MzUxMTYxNjM2OQ=="",""MzUxNTYxODg2NQ=="",""MzA5ODEyNTgwMA=="",""MzI5MDIyNDI5Mw=="",""MzA3ODMwMTg5Ng=="",""MzUyMjUxMzkwMg=="",""MzI3OTk2NzE2Mw=="",""MjM5MTc4MzQyMA=="",""MzA5NTMwMTE0NQ=="",""MjM5NzM3OTE4NA=="",""MjM5MjcxODYzNQ=="",""MjM5ODE2NTcyMA=="",""MjM5NzcwODI4MQ=="",""MzA3ODIwNTUwNA=="",""MzA4MjA1NjAzMQ=="",""MjM5NjI0OTM2NQ=="",""MjM5MTc3NDkwMA=="",""MjM5MTE4MzQzNg=="",""MzA5NzM3NjIwMg=="",""MzIwMTgwMzk1Ng=="",""MzUyNTAxOTQxMw=="",""MzIxODE3MDIyNA=="",""MzI3NDE3NjYxOQ=="",""MzIwODUyMDE4Nw=="",""MzIyNTE2ODg2OA=="",""MzUxMjAzNDUyOA=="",""MzIyOTE0Mzc1OQ=="",""MzIwNzI3ODk1Ng=="",""MzIxOTk0OTU3MQ=="",""MzIyMzg5NzA3Ng=="",""MzU5ODA0MDUzOA=="",""MjM5NDA3NTE0MA=="",""MzIyMjAwNDUzMg=="",""MjM5NjU4NTU3Mw=="",""MzAxNTE0MDAyOA=="",""MzI3NjEwMTIyMg=="",""MzUxOTEzMzYzOQ=="",""Mjc1MzA3NzI4MA=="",""MTQ0NDM5MTA2MQ=="",""MzI1MzAwNDc3Ng=="",""MzAwMTU2MDM3Nw=="",""MzA4NzEyODIzMA=="",""MzA4NTE1NDcxMw=="",""MzA3ODIwOTAzNw=="",""MzA3ODA5NTEwNQ=="",""MjM5OTYwMjUwMg=="",""MjIwNDUyMTEwMA=="",""MjM5NjI0NTkyMQ=="",""MzA5NjI2OTM1MQ=="",""MzA5OTU0NjQ4MQ=="",""MzA4NjE3MjMxMQ=="",""MzA3NTExNDcyMg=="",""MzAwNDU1NjYxOQ=="",""MjM5NzM4ODcyMg=="",""MzIyNzg5NTkxMw=="",""MzAwNTM5OTQzMg=="",""MzIyMTA1MDY2Mg=="",""MzA4OTQ2MTEwNg=="",""MjM5OTM3MjQ5NQ=="",""MjM5MTExMTMwOQ=="",""MjM5NTExMTYwMQ=="",""Mjc4NjgzMzYwMA=="",""MzI2OTE0NTgzNA=="",""MzA4NjM2NTcxMw=="",""MzA4NzM3NTMxMQ=="",""MjM5NjU2ODc1Mw=="",""MzA3NTMzMDYxNA=="",""MjM5OTc4Mjk1NQ=="",""MzAxMzc4NTYxMw=="",""MjM5MDE3Njc3Ng=="",""MzA4MTQzMTUwOQ=="",""MzA3NzMwMDQwMg=="",""MzA3OTU4ODUyMw=="",""MzA5MzM1NDEzNg=="",""MzA4MDA2OTgwOQ=="",""MTYyODMwOTMwMQ=="",""MzIxMTAwOTc5Ng=="",""MzA5MTM3OTkxMQ=="",""MzA4NjEwNjMzOQ=="",""MzI0NTM5NjM4Ng=="",""MzA3NTE1NzQxMw=="",""MzA3MDI0NzcxOQ=="",""MzI3NTE4OTQ3Mw=="",""MzUzMjUwNzMzOA=="",""MzU0MzM0MzI4Mw=="",""MzAxMDg0NDI4OA=="",""MjM5MDQ3NDA3Ng=="",""MzA3MzM2NzcyNA=="",""MzIzNjcxNDA1Mg=="",""MzA5MzM2ODgzOA=="",""MjM5ODMyNTE4MA=="",""MjM5MTcxMjA5Mw=="",""MzAxODg0MTg4MQ=="",""MzA3NjM5NjkwNw=="",""MjM5MTUzMzM3Nw=="",""MzA5NDg2MzYzNA=="",""MjM5MjcwODg0MA=="",""MjM5MTUxNzQ3MQ=="",""MjM5MzUxODExNQ=="",""MzA4OTMwNDcwMg=="",""MzA5NDYwMDEzMg=="",""MzA3MTA4Njg3OQ=="",""MzA4NjEyODIxOA=="",""MzA5MzEzNTcxNg=="",""MzI1NDY0OTAyMw=="",""MzA3NDEzNDgxNA=="",""MzIxOTY1NzkyMA=="",""MzU0MzU4MDIxNw=="",""MzU0NjAwNzkzOA=="",""MjM5MzUwODA3NQ=="",""MzIyMjc1MjkxMw=="",""MzA3MjMxMjYzMA=="",""MzI5ODg5Mjc2Nw=="",""MzI2NTUzMzY3Mw=="",""MzI3NTE0MDg0Mg=="",""MzUzNzkwMDM0Mw=="",""MzU0ODcxMjg2Mg=="",""MzI2MzA3NzQ3MQ=="",""MjM5ODQ1Mzg4MQ=="",""MzI1ODU3Nzg2Nw=="",""MzA5ODA4ODQyNg=="",""MzU3Njg3MTIxMA=="",""MzA3NjQ1NTAyOQ=="",""MzA3OTI4ODk5Mw=="",""MzA5NzQzMDEwNA=="",""MzU5NzM2MDI5MQ=="",""MjM5ODc4OTk2MQ=="",""MzI0MDkwNTQ5MA=="",""MjM5MTU5OTI0MQ=="",""MzA5NjEwNTAxOA=="",""MzA4NjU5NTEwMQ=="",""MzIyMDQ3Mzk2OQ=="",""MzA4MTU1OTIzMA=="",""MzA4MTIwNTIwNQ=="",""MjM5NDE1MTMwMg=="",""MzI0NTE4ODgxNA=="",""MzU1NzUwNTg2Ng=="",""MjM5MTM5Nzc3MQ=="",""MzAwODg4MjYzOQ=="",""MjM5MDYxNTUzMA=="",""MzA5NDA2MjE1NQ=="",""MzI3MjA5NDU0MQ=="",""MzIzNzEwOTYwMg=="",""MzA5MTg0MTM0OA=="",""MzA5MjA3NTY0NA=="",""MzA5MjMyMTYyNQ=="",""MzIyOTAxNzUwOQ=="",""MzIxOTIzOTcxMA=="",""MzIxNzU1NTQwNQ=="",""Mzg2NTI2NDc0OQ=="",""Mzg4OTIzNTMzMA=="",""MzI3NDM3NTk2OQ==""]}}],""must"":[{""bool"":{""should"":[{""bool"":{""must"":[{""bool"":{""should"":[{""match_phrase"":{""news_title"":""\u591c\u91cc\u7684\u58f0\u97f3\u4f1a\u53d1\u5149""}},{""match_phrase"":{""news_content"":""\u591c\u91cc\u7684\u58f0\u97f3\u4f1a\u53d1\u5149""}}]}},{""bool"":{""should"":[{""match_phrase"":{""news_title"":""\u5b89\u5cb3\u4eba\u6c11""}},{""match_phrase"":{""news_content"":""\u5b89\u5cb3\u4eba\u6c11""}}]}},{""bool"":{""should"":[{""match_phrase"":{""news_title"":""\u7535\u53f0""}},{""match_phrase"":{""news_content"":""\u7535\u53f0""}}]}}]}},{""bool"":{""must"":[{""bool"":{""should"":[{""match_phrase"":{""news_title"":""\u591c\u91cc\u7684\u58f0\u97f3\u4f1a\u53d1\u5149""}},{""match_phrase"":{""news_content"":""\u591c\u91cc\u7684\u58f0\u97f3\u4f1a\u53d1\u5149""}}]}},{""bool"":{""should"":[{""match_phrase"":{""news_title"":""FM98.0""}},{""match_phrase"":{""news_content"":""FM98.0""}}]}}]}}]}}]}} }&repos=weixin1_retention&startTime=1603555200000&endTime=1603641599000",iZbp1d6qlouk35su365hwzZ,200 OK,太极项目,80,/es/search/msearch
2491b7ccfa0d409da2edadb25c84bb6a,2020-10-26T05:05:12.255Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""汪小魔"" AND news_title:""魔术"") OR (news_content:""汪小魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-01-13T00:00:00 TO 2019-01-14T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1547308800000&endTime=1547395200000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,66,/es/search/scroll
be0c58a2967a4e3991796d1839b29ce7,2020-10-26T05:05:12.279Z,127.0.0.1,"http://localhost/es/search/native?statement={""index"":[""app1_retention""],""ignore_unavailable"":true}
{""size"":1,""sort"":[{""wx_wci"":{""order"":""desc""}}],""query"": {""bool"": {""must"": {""bool"": {""must"": [{""match_phrase"": {""news_title"": ""科学家发现人体新器官""}}]}}}},""_source"": []}&startTime=1602479111000&endTime=1603688711000",iZbp1d6qlouk35su365hwzZ,200 OK,清博指数,225,/es/search/native
946d1622e001487b991278d933df7da1,2020-10-26T05:05:12.741Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:forum AND news_posttime:[2019-04-22T00:00:00 TO 2019-04-23T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555862400000&endTime=1555948800000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,306,/es/search/scroll
样本上传
# 切换到自定义存储文件目录
> cd /opt/document
> vi analyse.csv
数据【复制上面的简化版的数据】
样本预处理【简化版的数据】
由于csv文件是以纯文本形式存储表格数据(数字和文本),数据本身很容易发生错乱情况。目前提供简化版的数据,出现若干行中的列值存在换行符和csv默认使用的分隔符。这就会导致将简化版的数据导入hive表中会出现错行和列错乱的情况。我希望在数据导入hive之前先提前处理这块数据。
去除【简化版的数据】某些列中的换行符
我们参考下面的文章,使用python脚本【为啥?主要python无需编译且拥有强大的三方库支持】进行数据替换 替换csv文件中某个字段含有换行符的问题。
# 编辑python脚本
> vi filter.py
import csv
import sys
with open(sys.argv[1], 'r') as srcFile, open(sys.argv[2], 'w') as dstFile:
fileReader = csv.reader(srcFile)
fileWriter = csv.writer(dstFile)
for data in list(fileReader):
for i,d in enumerate(data):
if d.find('\r\n') != -1:
d = d.replace('\r\n', ' ')
if d.find('\n') != -1:
d = d.replace('\n', ' ')
if d.find('\r') != -1:
d = d.replace('\r', ' ')
if d.find('\\') != -1:
d = d.replace('\\', '')
data[i] = d
fileWriter.writerow(data)
dstFile.close()
srcFile.close()
> chmod 777 filter.py
# 执行
> python filter.py analyse.csv output.csv
# 查看
> less output.csv
# 截取简化版的案例数据中存在错行问题的行【细心可以比对一下处理前后的差别】
2491b7ccfa0d409da2edadb25c84bb6a,2020-10-26T05:05:12.255Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""汪小魔"" AND news_title:""魔术"") OR (new
s_content:""汪小魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-01-13T00:00:00 TO 2019-01-14T00:00:00}&size=500&repo=web2_retention&fields=news
_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_
news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1547308800000&endTime=1
547395200000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,66,/es/search/scroll
be0c58a2967a4e3991796d1839b29ce7,2020-10-26T05:05:12.279Z,127.0.0.1,"http://localhost/es/search/native?statement={""index"":[""app1_retention""],""ignore_unavailable"":tru
e} {""size"":1,""sort"":[{""wx_wci"":{""order"":""desc""}}],""query"": {""bool"": {""must"": {""bool"": {""must"": [{""match_phrase"": {""news_title"": ""科学家发现人体新
器官""}}]}}}},""_source"": []}&startTime=1602479111000&endTime=1603688711000",iZbp1d6qlouk35su365hwzZ,200 OK,清博指数,225,/es/search/native
946d1622e001487b991278d933df7da1,2020-10-26T05:05:12.741Z,127.0.0.1,"http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_
content:""许仁"" AND news_content:""星光"")) AND news_media:forum AND news_posttime:[2019-04-22T00:00:00 TO 2019-04-23T00:00:00}&size=500&repo=web2_retention&fields=news_m
edia,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_ne
ws_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1555862400000&endTime=155
5948800000",iZbp1d6qlouk35su365hwzZ,200 OK,数据迁移服务,306,/es/search/scroll
......
将【简化版的数据】的csv格式数据转换成tsv文件格式,并去除表头
由于csv默认格式是使用逗号进行分割,若分割的列中含有逗号,容易出现错列情况。 我们参考下面的文章,使用python脚本【为啥?主要python无需编译且拥有强大的三方库支持】进行文件的格式转换。 如何将CSV格式文件转成TSV格式文件。
# 编辑脚本
> conversion.py
import sys
import csv
import pandas as pd
pd_all = pd.read_csv(sys.argv[1])
pd_all.to_csv(sys.argv[2], sep='\t', header=None, index=False, mode="w")
> chmod 777 conversion.py
# 执行脚本
> python conversion.py output.csv analysed.csv
# 查看
> less analysed.csv
# 截取简化版的案例数据中存在错列问题的行【细心可以比对一下处理前后的差别】
2491b7ccfa0d409da2edadb25c84bb6a 2020-10-26T05:05:12.255Z 127.0.0.1 "http://localhost/es/search/scroll?queryString=((news_title:""汪小魔"" AND news_tit
le:""魔术"") OR (news_content:""汪小魔"" AND news_content:""魔术"")) AND news_media:web AND news_posttime:[2019-01-13T00:00:00 TO 2019-01-14T00:00:00}&size=500&repo=web2_r
etention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin
_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=154
7308800000&endTime=1547395200000" iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 66 /es/search/scroll
be0c58a2967a4e3991796d1839b29ce7 2020-10-26T05:05:12.279Z 127.0.0.1 "http://localhost/es/search/native?statement={""index"":[""app1_retention""],""igno
re_unavailable"":true} {""size"":1,""sort"":[{""wx_wci"":{""order"":""desc""}}],""query"": {""bool"": {""must"": {""bool"": {""must"": [{""match_phrase"": {""news_title"":
""科学家发现人体新器官""}}]}}}},""_source"": []}&startTime=1602479111000&endTime=1603688711000" iZbp1d6qlouk35su365hwzZ 200 OK 清博指数 225 /es/search/native
946d1622e001487b991278d933df7da1 2020-10-26T05:05:12.741Z 127.0.0.1 "http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title
:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:forum AND news_posttime:[2019-04-22T00:00:00 TO 2019-04-23T00:00:00}&size=500&repo=web2_ret
ention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_a
uthor_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=15558
62400000&endTime=1555948800000" iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 306 /es/search/scroll
......
分析样本
构建临时表
创建临时表
# 构建hive可以参考我之前的文章
> hive
hive> show databases;
hive> create database daiyongjun;
hive> use daiyongjun;
hive> CREATE TEMPORARY TABLE TEMPORARY_TABLE(id string,time string,ip string,url string,hostname string,status string,application string,duration string,path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive> show tables;
OK
temporary_table
Time taken: 0.275 seconds, Fetched: 1 row(s)
写入临时表
hive> LOAD DATA LOCAL INPATH '/opt/document/analysed.csv' INTO TABLE TEMPORARY_TABLE;
Loading data to table daiyongjun.temporary_table
OK
Time taken: 1.339 seconds
hive> SELECT * FROM TEMPORARY_TABLE LIMIT 10;
OK
afd25001aee74f5e94409750b64c383d 2020-10-26T05:05:11.078Z 127.0.0.1 "http://localhost/es/search/scroll?queryString=((news_title:""许仁"" AND news_title:""星光"") OR (news_content:""许仁"" AND news_content:""星光"")) AND news_media:journal AND news_posttime:[2019-04-12T00:00:00 TO 2019-04-13T00:00:00}&size=500&repo=web2_retention&fields=news_media,app_name,news_channel,news_title,news_digest,news_positive,news_posttime,news_url,content_city,content_province,news_is_origin,media_type,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1554998400000&endTime=1555084800000" iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 172 /es/search/scroll
3c8b14238fe94ce1ab02088ec881e605 2020-10-26T05:05:11.194Z 127.0.0.1 "http://localhost/es/search/scroll?queryString=((news_title:""侯杰"" AND news_title:""豪杰"") OR (news_content:""侯杰"" AND news_content:""豪杰"")) AND news_media:wx AND news_posttime:[2020-02-26T00:00:00 TO 2020-02-27T00:00:00}&size=500&repo=weixin1_retention&fields=news_media,wx_name,app_name,news_channel,news_title,news_digest,wx_news_position,news_positive,news_posttime,news_url,content_city,content_province,media_type,news_is_origin,origin_author_name,origin_news_url,news_uuid,sim_hash,news_keywords_list,news_province,news_city,author_gender,news_negative,app_code,picurl,is_video,news_postdate&startTime=1582646400000&endTime=1582732800000" iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 139 /es/search/scroll
......
构建最终表
创建最终表
hive> CREATE TABLE ANALYSED(id string,time string,ip string,startTime timestamp,endTime timestamp,hostname string,status string,application string,duration string,path string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
写入最终表
分析业务使用数据分布情况,业务是通过starttime控制检索数据范围也就是日志URL中的startime字段。因此我们只需要获取starttime和endttime(可以不需要)字段即可。 我们将使用hive内置的函数获取URL中的starttime字段。并将字段转换成可以聚合的"YYYY-MM-dd HH:mm:ss"格式。
regexp_extract(string subject, string pattern, int index)
通过正则表达式对字符串进行截取
hive> SELECT regexp_extract(URL,'(startTime=)([0-9]{13})',2) FROM TEMPORARY_TABLE LIMIT 2;
# hive内置的函数实际都是 map-reduce方法 【扩展】
Query ID = daiyongjun_20201105023504_d50dbbf1-835e-4b3f-b1bc-7bcc0e884804
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
......
OK
1554998400000
1582646400000
Time taken: 80.09 seconds, Fetched: 2 row(s)
cast(expr as )
将截取后的字符串,强制转换成BIGINT类型
hive> SELECT cast(regexp_extract(URL,'(startTime=)([0-9]{13})',2) as BIGINT) FROM TEMPORARY_TABLE LIMIT 2;
Query ID = daiyongjun_20201105024739_233f875e-5e41-4170-b5a8-5961c5cf3bcd
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
......
Total MapReduce CPU Time Spent: 11 seconds 40 msec
OK
1554998400000
1582646400000
from_unixtime(bigint unixtime[, string format])
将截取后的字符串,强制转换成BIGINT类型,在转换format of "1970-01-01 00:00:00"类型。【扩展1554998400000转换失效需要/1000如(1554998400000/1000),可是(1554998400000/1000)是Double类型,无法隐式转换需要再一次cast强制转换。】
hive> SELECT cast(cast(regexp_extract(URL,'(startTime=)([0-9]{13})',2) as BIGINT)/1000 as BIGINT) FROM TEMPORARY_TABLE LIMIT 2;
Query ID = daiyongjun_20201105030538_f1198cc8-9f4f-463b-a8c4-dbbfa634948d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
......
Total MapReduce CPU Time Spent: 10 seconds 960 msec
OK
1554998400
1582646400
hive> SELECT from_unixtime(cast(cast(regexp_extract(URL,'(startTime=)([0-9]{13})',2) as BIGINT)/1000 as BIGINT),'yyyy-MM-dd hh:mm:ss') FROM TEMPORARY_TABLE LIMIT 2;
Query ID = daiyongjun_20201105030816_20f40a42-2f3f-4eaa-9624-b505aa983e9f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
......
Total MapReduce CPU Time Spent: 2 seconds 520 msec
OK
2019-04-11 12:00:00
2020-02-25 11:00:00
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement
将临时表的数据写入最终表
hive> INSERT OVERWRITE TABLE ANALYSED SELECT id,time,ip,from_unixtime(cast(cast(regexp_extract(url,'(startTime=)([0-9]{13})',2) as bigint)/1000 as bigint),'yyyy-MM-dd hh:mm:ss') as startTime,from_unixtime(cast(cast(regexp_extract(url,'(endTime=)([0-9]{13})',2) as bigint)/1000 as bigint),'yyyy-MM-dd hh:mm:ss') as endTime,hostname,status,application,duration,path FROM TEMPORARY_TABLE LIMIT 10;
Query ID = daiyongjun_20201105031309_bf73aeca-51ab-405e-af51-fc1c743c5cc6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
......
Total MapReduce CPU Time Spent: 30 seconds 470 msec
OK
Time taken: 62.033 seconds
hive> SELECT * FROM ANALYSED;
OK
c30779e6a52f4f6f966f7f04af47b66f 2020-10-26T05:05:12.052Z 127.0.0.1 2019-04-18 12:00:00 2019-04-19 12:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 155 /es/search/scroll
4a71ee5357174a43919e07c4cafb9eb7 2020-10-26T05:05:11.991Z 127.0.0.1 2019-01-08 11:00:00 2019-01-09 11:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 58 /es/search/scroll
68611f445fb54d269148825cdb43afdb 2020-10-26T05:05:11.889Z 127.0.0.1 2019-04-17 12:00:00 2019-04-18 12:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 125 /es/search/scroll
f8f66d478bde4690a3964fbc1de566dc 2020-10-26T05:05:11.686Z 127.0.0.1 2019-01-03 11:00:00 2019-01-04 11:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 72 /es/search/scroll
445c5b48694745c196be3f59102b677b 2020-10-26T05:05:11.574Z 127.0.0.1 2020-02-26 11:00:00 2020-02-27 11:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 373 /es/search/scroll
fd44e8f6e2c141eb8c5a36da3d9e64c5 2020-10-26T05:05:11.48Z 127.0.0.1 2019-04-14 12:00:00 2019-04-15 12:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 153 /es/search/scroll
03e440b5905f4ac9b0074c0923191aea 2020-10-26T05:05:11.264Z 127.0.0.1 2019-12-30 11:00:00 2019-12-31 11:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 135 /es/search/scroll
ad433ae4710c4fb9b4f09f5987dbace1 2020-10-26T05:05:11.208Z 127.0.0.1 2019-04-12 12:00:00 2019-04-13 12:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 129 /es/search/scroll
3c8b14238fe94ce1ab02088ec881e605 2020-10-26T05:05:11.194Z 127.0.0.1 2020-02-25 11:00:00 2020-02-26 11:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 139 /es/search/scroll
afd25001aee74f5e94409750b64c383d 2020-10-26T05:05:11.078Z 127.0.0.1 2019-04-11 12:00:00 2019-04-12 12:00:00 iZbp1d6qlouk35su365hwzZ 200 OK 数据迁移服务 172 /es/search/scroll
Time taken: 0.347 seconds, Fetched: 10 row(s)
分析最终表
分析
to_date(timestamp timestamp)
将timestamp格式转换成字符串
hive> SELECT to_date(startTime) FROM ANALYSED LIMIT 2;
Query ID = daiyongjun_20201105032948_3d070746-5f54-4f13-9c8c-4d82a12a785a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
......
Total MapReduce CPU Time Spent: 10 seconds 150 msec
OK
2019-04-18
2019-01-08
Time taken: 52.994 seconds, Fetched: 2 row(s)
year(string date);month(string date)
将字符串类型的转化成年份和月份
hive> SELECT year(to_date(startTime)),month(to_date(startTime)) FROM ANALYSED LIMIT 2;
Query ID = daiyongjun_20201105033603_078a3d3a-3431-41b6-a4c7-574c1f0b7b3e
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
.....
Total MapReduce CPU Time Spent: 10 seconds 590 msec
OK
2019 4
2019 1
GROUP BY year(to_date(startTime)),month(to_date(startTime))
针对startime进行复合聚合得出结果
hive> SELECT year(to_date(startTime)),month(to_date(startTime)),count(*) FROM ANALYSED GROUP BY year(to_date(startTime)),month(to_date(startTime));
Query ID = daiyongjun_20201105034005_f0257113-ffa5-485b-b69b-18ad129d6a24
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
......
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 21.01 sec HDFS Read: 12785 HDFS Write: 172 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 21 seconds 10 msec
OK
2019 1 2
2019 4 5
2019 12 1
2020 2 2