大数据数仓平台-用户行为数据采集

461 阅读3分钟

数仓流程图

image.png

用户行为日志

用户行为日志的内容,主要包括用户的各项行为信息以及行为所处的环境信息。收集这些信息的主要目的是优化产品和为各项分析统计指标提供数据支撑。收集这些信息的手段通常为埋点

用户行为日志内容

本项目收集和分析的用户行为信息主要有页面浏览记录、动作记录、曝光记录、启动记录和错误记录。

1665974783091.png

模拟数据

1)将 application.yml gmall2020-mock-log-2021-10-10.jar path.json logback.xml 上传到hadoop 102 的/ opt/module/applog 目录下****

(1)创建applog路径

mkdir /opt/module/applog

(2)上传文件到/opt/module/applog目录

(2)配置文件****

(1)application.yml文件

可以根据需求生成对应日期的用户行为日志。

vim application.yml

修改如下内容

# 外部配置打开

logging.config: "./logback.xml"

#业务日期  注意:并不是Linux系统生成日志的日期,而是生成数据中的时间

mock.date: "2020-06-14"

 

#模拟数据发送模式

#mock.type: "http"

#mock.type: "kafka"

mock.type: "log"

 

#http模式下,发送的地址

mock.url: "http://hdp1/applog"

 

#kafka模式下,发送的地址

mock:

  kafka-server: "hdp1:9092,hdp2:9092,hdp3:9092"

  kafka-topic: "ODS_BASE_LOG"

 

#启动次数

mock.startup.count: 200

#设备最大值

mock.max.mid: 500000

#会员最大值

mock.max.uid: 100

#商品最大值

mock.max.sku-id: 35

#页面平均访问时间

mock.page.during-time-ms: 20000

#错误概率 百分比

mock.error.rate: 3

#每条日志发送延迟 ms

mock.log.sleep: 10

#商品详情来源  用户查询,商品推广,智能推荐, 促销活动

mock.detail.source-type-rate: "40:25:15:20"

#领取购物券概率

mock.if_get_coupon_rate: 75

#购物券最大id

mock.max.coupon-id: 3

#搜索关键词  

mock.search.keyword: "图书,小米,iphone11,电视,口红,ps5,苹果手机,小米盒子"

(2)path.json,该文件用来配置访问路径

根据需求,可以灵活配置用户点击路径

[

{"path":["home","good_list","good_detail","cart","trade","payment"],"rate":20 },

{"path":["home","search","good_list","good_detail","login","good_detail","cart","trade","payment"],"rate":40 },

{"path":["home","mine","orders_unpaid","trade","payment"],"rate":10 },

{"path":["home","mine","orders_unpaid","good_detail","good_spec","comment","trade","payment"],"rate":5 },

{"path":["home","mine","orders_unpaid","good_detail","good_spec","comment","home"],"rate":5 },

{"path":["home","good_detail"],"rate":10 },

{"path":["home"  ],"rate":10 }

]

(3)logback配置文件

可配置日志生成路径,修改内容如下

<?xml version="1.0" encoding="UTF-8"?>

<configuration>

    <property name="**LOG_HOME**" value=" **/opt/module/applog/log**" />

    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">

        <encoder>

            <pattern>%msg%n</pattern>

        </encoder>

    </appender>

 

    <appender name="rollingFile" class="ch.qos.logback.core.rolling.RollingFileAppender">

        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">

            <fileNamePattern>${LOG_HOME}/app.%d{yyyy-MM-dd}.log</fileNamePattern>

        </rollingPolicy>

        <encoder>

            <pattern>%msg%n</pattern>

        </encoder>

    </appender>

 

    <!-- 将某一个包下日志单独打印日志 -->

    <logger name="com.atgugu.gmall2020.mock.log.util.LogUtil"

            level="INFO" additivity="false">

        <appender-ref ref="rollingFile" />

        <appender-ref ref="console" />

    </logger>

 

    <root level="error"  >

        <appender-ref ref="console" />

    </root>

</configuration>

(3)生成日志

(1)进入到/opt/module/applog路径,执行以下命令

java -jar gmall2020-mock-log-2021-10-10.jar

(2)在/opt/module/applog/log目录下查看生成日志

ll

日志采集Flume

按照规划,需要采集的用户行为日志文件hadoop日志服务器,故需要在hadoop配置日志采集Flume。日志采集Flume需要采集日志文件内容,并对日志格式(JSON)进行校验,然后将校验通过的日志发送到Kafka。

此处可选择TaildirSource和KafkaChannel,并配置日志校验拦截器。

选择TailDirSource和KafkaChannel的原因如下:

1) TailDirSource

TailDirSource相比ExecSource、SpoolingDirectorySource的优势

TailDirSource:断点续传、多目录。Flume1.6以前需要自己自定义Source记录每次读取文件位置,实现断点续传。

ExecSource可以实时搜集数据,但是在Flume不运行或者Shell命令出错的情况下,数据将会丢失。

SpoolingDirectorySource监控目录,支持断点续传。

2)Kafka Channel

采用Kafka Channel,省去了Sink,提高了效率。

日志采集Flume关键配置如下:

image.png

日志采集Flume配置实操

创建Flume配置文件

在Flume的目录中创建job目录病创建file_to_kafka.conf

mkdir job
vim job/file_to_kafka.conf

配置文件内容如下

#定义组件

a1.sources = r1

a1.channels = c1

 

#配置source

a1.sources.r1.type = TAILDIR

a1.sources.r1.filegroups = f1

a1.sources.r1.filegroups.f1 = /opt/module/applog/log/app.*

a1.sources.r1.positionFile = /opt/module/flume/taildir_position.json

a1.sources.r1.interceptors =  i1

a1.sources.r1.interceptors.i1.type = com.atguigu.gmall.flume.interceptor.ETLInterceptor$Builder

 

#配置channel

a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel

a1.channels.c1.kafka.bootstrap.servers = hadoop102:9092,hadoop103:9092

a1.channels.c1.kafka.topic = topic_log

a1.channels.c1.parseAsFlumeEvent = false

 

#组装 

a1.sources.r1.channels = c1

编写Flume拦截器

(1)创建Maven工程flume-interceptor

(2)创建包:com.atguigu.gmall.flume.interceptor

(3)在pom.xml文件中添加如下配置

<dependencies>  
    <dependency>  
        <groupId>org.apache.flume</groupId>  
        <artifactId>flume-ng-core</artifactId>  
        <version>1.9.0</version>  
        <scope>provided</scope>  
    </dependency>  
  
    <dependency>  
        <groupId>com.alibaba</groupId>  
        <artifactId>fastjson</artifactId>  
        <version>1.2.62</version>  
    </dependency>  
</dependencies>  
  
<build>  
    <plugins>  
        <plugin>  
            <artifactId>maven-compiler-plugin</artifactId>  
            <version>2.3.2</version>  
            <configuration>  
                <source>1.8</source>  
                <target>1.8</target>  
            </configuration>  
        </plugin>  
        <plugin>  
            <artifactId>maven-assembly-plugin</artifactId>  
            <configuration>  
                <descriptorRefs>  
                    <descriptorRef>jar-with-dependencies</descriptorRef>  
                </descriptorRefs>  
            </configuration>  
            <executions>  
                <execution>  
                    <id>make-assembly</id>  
                    <phase>package</phase>  
                    <goals>  
                        <goal>single</goal>  
                    </goals>  
                </execution>  
            </executions>  
        </plugin>  
    </plugins>  
</build>

(4)在com.atguigu.gmall.flume.utils包下创建JSONUtil类

package com.atguigu.gmall.flume.utils;  
  
import com.alibaba.fastjson.JSONObject;  
import com.alibaba.fastjson.JSONException;  
  
public class JSONUtil {

/*

* 通过异常判断是否是json字符串

* 是:返回true  不是:返回false

* */  
    public static boolean isJSONValidate(String log){  
        try {  
            JSONObject.parseObject(log);  
            return true;  
        }catch (JSONException e){  
            return false;  
        }  
    }  
}

(5)在com.atguigu.gmall.flume.interceptor包下创建ETLInterceptor类


package com.atguigu.gmall.flume.interceptor;

 

import com.atguigu.gmall.flume.utils.JSONUtil;

import org.apache.flume.Context;

import org.apache.flume.Event;

import org.apache.flume.interceptor.Interceptor;

 

 

import java.nio.charset.StandardCharsets;

import java.util.Iterator;

import java.util.List;

 

public class ETLInterceptor implements Interceptor {

 

    @Override

    public void initialize() {

 

    }

 

    @Override

    public Event intercept(Event event) {

//1、获取body当中的数据并转成字符串

        byte[] body = event.getBody();

        String log = new String(body, StandardCharsets.UTF_8);

//2、判断字符串是否是一个合法的json,是:返回当前event;不是:返回null

        if (JSONUtil.isJSONValidate(log)) {

            return event;

        } else {

            return null;

        }

    }

 

    @Override

    public List<Event> intercept(List<Event> list) {

 

        Iterator<Event> iterator = list.iterator();

 

        while (iterator.hasNext()){

            Event next = iterator.next();

            if(intercept(next)==null){

                iterator.remove();

            }

        }

 

        return list;

    }

 

    public static class Builder implements Interceptor.Builder{

 

        @Override

        public Interceptor build() {

            return new ETLInterceptor();

        }

        @Override

        public void configure(Context context) {

 

        }
       }

    @Override

    public void close() {

 

    }

}

(7)需要先将打好的包放入到/opt/flume/lib文件夹下面

日志采集Flume测试

1)启动Zookeeper、Kafka集群

2)启动日志采集Flume

bin/flume-ng agent -n a1 -c conf/ -f job/file_to_kafka.conf -Dflume.root.logger=info,console

3)启动一个Kafka的Console-Consumer

bin/kafka-console-consumer.sh --bootstrap-server hadoop102:9092 --topic topic_log --from-beginning

4 )生成模拟数据

执行模拟数据

5 )观察 Kafka 消费者是否能消费到数据

学习记录自尚硅谷大数据