大数据应用与服务赛项笔记

0 阅读11分钟

📊 ZZ052-大数据应用与服务赛项操作手册


🚀 一、模块一:平台搭建与运维

🔧 1.1 环境前置配置(所有节点执行)

步骤0:关闭防火墙

systemctl stop firewalld

systemctl disable firewalld

  • 截图点systemctl status firewalld 显示 inactive

  • 结果文件:M1-T1-SUBT1-提交结果1.docx

步骤1:配置主机名

# master节点

hostnamectl set-hostname master

# slave1节点

hostnamectl set-hostname slave1

# slave2节点

hostnamectl set-hostname slave2

  • 截图点hostname 显示对应主机名

步骤2:配置hosts文件

cat >> /etc/hosts <<EOF

192.168.1.10 master

192.168.1.11 slave1

192.168.1.12 slave2

EOF

  • 截图点cat /etc/hosts 显示三行映射

  • ⚠️ 三台机器内容必须完全一致

步骤3:配置免密登录(master执行)

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa

ssh-copy-id -i ~/.ssh/id_rsa.pub root@master

ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave1

ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave2

  • 截图点ssh slave1 "echo success" 直接输出success

步骤4:配置时间同步(推荐)

yum install -y ntpdate

crontab -e

# 添加:

* * * * * /usr/sbin/ntpdate -u ntp.aliyun.com > /dev/null 2>&1

  • 截图点crontab -l 显示定时任务


🔧 1.2 Hadoop完全分布式安装配置(M1-T1-SUBT1)

步骤1:创建目录

mkdir -p /opt/module

mkdir -p /opt/software

步骤2:解压JDK

tar -zxvf /opt/software/jdk-8u191-linux-x64.tar.gz -C /opt/module/

  • 截图点ls /opt/module 显示 jdk1.8.0_191

步骤3:解压Hadoop

tar -zxvf /opt/software/hadoop-3.1.3.tar.gz -C /opt/module/

  • 截图点ls /opt/module 显示 hadoop-3.1.3

步骤4:配置环境变量(所有节点)

cat >> /etc/profile <<EOF

export JAVA_HOME=/opt/module/jdk1.8.0_191

export HADOOP_HOME=/opt/module/hadoop-3.1.3

export PATH=\$PATH:\$JAVA_HOME/bin:\$HADOOP_HOME/bin:\$HADOOP_HOME/sbin

EOF

source /etc/profile

  • 截图点echo $JAVA_HOMEhdfs version

  • ⚠️ $ 必须转义为 \$,否则变量提前展开

步骤5:配置Hadoop核心文件(master执行)

# 配置 core-site.xml

cat > $HADOOP_HOME/etc/hadoop/core-site.xml <<EOF

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>fs.defaultFS</name>

    <value>hdfs://master:9820</value>

  </property>

  <property>

    <name>hadoop.tmp.dir</name>

    <value>/opt/module/hadoop-3.1.3/tmp</value>

  </property>

</configuration>

EOF


# 配置 hdfs-site.xml

cat > $HADOOP_HOME/etc/hadoop/hdfs-site.xml <<EOF

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>dfs.replication</name>

    <value>3</value>

  </property>

</configuration>

EOF


# 配置 yarn-site.xml

cat > $HADOOP_HOME/etc/hadoop/yarn-site.xml <<EOF

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>yarn.resourcemanager.hostname</name>

    <value>master</value>

  </property>

  <property>

    <name>yarn.nodemanager.aux-services</name>

    <value>mapreduce_shuffle</value>

  </property>

</configuration>

EOF


# 配置 workers 文件

cat > $HADOOP_HOME/etc/hadoop/workers <<EOF

master

slave1

slave2

EOF

  • 截图点grep -A2 fs.defaultFS $HADOOP_HOME/etc/hadoop/core-site.xml

  • ⚠️ Hadoop 3.x 使用 workers 文件,不是 slaves

步骤6:配置 hadoop-env.sh(所有节点)

cat >> $HADOOP_HOME/etc/hadoop/hadoop-env.sh <<EOF

export JAVA_HOME=/opt/module/jdk1.8.0_191

export HDFS_NAMENODE_USER=root

export HDFS_DATANODE_USER=root

export HDFS_SECONDARYNAMENODE_USER=root

export YARN_RESOURCEMANAGER_USER=root

export YARN_NODEMANAGER_USER=root

EOF

  • 截图点grep -E "JAVA_HOME|_USER=root" $HADOOP_HOME/etc/hadoop/hadoop-env.sh

  • ⚠️ 必须配置 _USER=root,否则启动报权限错误

步骤7:分发Hadoop到slave节点

scp -r /opt/module/hadoop-3.1.3 root@slave1:/opt/module/

scp -r /opt/module/hadoop-3.1.3 root@slave2:/opt/module/

scp /etc/profile root@slave1:/etc/profile

scp /etc/profile root@slave2:/etc/profile

ssh slave1 "source /etc/profile"

ssh slave2 "source /etc/profile"

  • 截图点ssh slave1 "hdfs version" 显示版本信息

步骤8:格式化NameNode(仅master,仅一次!)

hdfs namenode -format

  • 截图点:最后20行包含 Formatting completed successfully

  • ⚠️ ⛔ 只能格式化一次!重复格式化=数据丢失=0分

步骤9:启动集群

start-dfs.sh

start-yarn.sh

  • 截图点

  - master执行 jps:NameNode、ResourceManager、DataNode

  - slave1/slave2执行 jps:DataNode、NodeManager

  • ⚠️ 缺一个进程扣2分,务必逐节点检查


🔧 1.3 MySQL安装配置

步骤1:解压并重命名

tar -xvf /opt/software/mysql-5.7.25-linux-glibc2.12-x86_64.tar.gz -C /opt/module/

mv /opt/module/mysql-5.7.25-linux-glibc2.12-x86_64 /opt/module/mysql-5.7.25

步骤2:创建用户和组

groupadd mysql

useradd -r -g mysql mysql

步骤3:设置权限

cd /opt/module/mysql-5.7.25

mkdir -p data

chown -R mysql:mysql ./

步骤4:初始化数据库

bin/mysqld --initialize --user=mysql --basedir=/opt/module/mysql-5.7.25 --datadir=/opt/module/mysql-5.7.25/data

  • ⚠️ 务必记录输出的临时密码

步骤5:配置环境变量

cat >> /etc/profile <<EOF

export MYSQL_HOME=/opt/module/mysql-5.7.25

export PATH=\$PATH:\$MYSQL_HOME/bin

EOF

source /etc/profile

步骤6:配置系统服务

cp support-files/mysql.server /etc/init.d/mysql

sed -i "s|^basedir=.*|basedir=/opt/module/mysql-5.7.25|" /etc/init.d/mysql

sed -i "s|^datadir=.*|datadir=/opt/module/mysql-5.7.25/data|" /etc/init.d/mysql

chkconfig --add mysql

chkconfig mysql on

步骤7:启动并修改密码

service mysql start

mysql -uroot -p  # 输入临时密码

# MySQL中执行:

ALTER USER 'root'@'localhost' IDENTIFIED BY '123456';

FLUSH PRIVILEGES;

exit

步骤8:开启远程访问

mysql -uroot -p123456

# MySQL中执行:

USE mysql;

UPDATE user SET host='%' WHERE user='root';

FLUSH PRIVILEGES;

exit

  • 验证mysql -uroot -p123456 -h master -e "SELECT 1"


🔧 1.4 Hive安装配置

步骤1:解压Hive

tar -zxvf /opt/software/apache-hive-3.1.2-bin.tar.gz -C /opt/module/

步骤2:配置环境变量

cat >> /etc/profile <<EOF

export HIVE_HOME=/opt/module/apache-hive-3.1.2-bin

export PATH=\$PATH:\$HIVE_HOME/bin

EOF

source /etc/profile

步骤3:配置hive-env.sh

cp $HIVE_HOME/conf/hive-env.sh.template $HIVE_HOME/conf/hive-env.sh

cat >> $HIVE_HOME/conf/hive-env.sh <<EOF

export HADOOP_HOME=/opt/module/hadoop-3.1.3

export HIVE_CONF_DIR=\$HIVE_HOME/conf

export HIVE_AUX_JARS_PATH=\$HIVE_HOME/lib

EOF

步骤4:添加MySQL驱动

cp /opt/software/mysql-connector-java-5.1.47-bin.jar $HIVE_HOME/lib/

步骤5:配置hive-site.xml

cat > $HIVE_HOME/conf/hive-site.xml <<EOF

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>

    <name>javax.jdo.option.ConnectionURL</name>

    <value>jdbc:mysql://master:3306/hive?createDatabaseIfNotExist=true&amp;useSSL=false&amp;useUnicode=true&amp;characterEncoding=UTF-8</value>

  </property>

  <property>

    <name>javax.jdo.option.ConnectionDriverName</name>

    <value>com.mysql.jdbc.Driver</value>

  </property>

  <property>

    <name>javax.jdo.option.ConnectionUserName</name>

    <value>root</value>

  </property>

  <property>

    <name>javax.jdo.option.ConnectionPassword</name>

    <value>123456</value>

  </property>

</configuration>

EOF

  • ⚠️ XML中 & 必须转义为 &amp;

步骤6:初始化元数据库

schematool -dbType mysql -initSchema

  • 截图点:最后显示 schemaTool completed

步骤7:测试Hive

hive

CREATE DATABASE IF NOT EXISTS test;

USE test;

CREATE TABLE student(id INT, name STRING);

INSERT INTO TABLE student VALUES(1, 'test');

SELECT * FROM student;

exit


🔧 1.5 Flume安装配置

步骤1~3:解压、环境变量、flume-env.sh

(同前,略)

步骤4:创建Flume配置文件

cat > $FLUME_HOME/conf/hdfs-flume.conf <<EOF

a1.sources = r1

a1.channels = c1

a1.sinks = k1

  


a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /opt/module/hadoop-3.1.3/logs/hadoop-root-namenode-master.log

a1.sources.r1.channels = c1

  


a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

  


a1.sinks.k1.type = hdfs

a1.sinks.k1.hdfs.path = /tmp/flume/%Y%m%d

a1.sinks.k1.hdfs.filePrefix = namenode-log-

a1.sinks.k1.hdfs.rollInterval = 30

a1.sinks.k1.hdfs.fileType = DataStream

a1.sinks.k1.channel = c1

EOF

  • ⚠️ 日志路径必须用绝对路径,且文件真实存在

步骤5:启动Flume

nohup flume-ng agent -n a1 -f $FLUME_HOME/conf/hdfs-flume.conf -Dflume.root.logger=INFO,console > /dev/null 2>&1 &

步骤6:验证HDFS数据

sleep 30  # 等待数据积累

hdfs dfs -ls /tmp/flume

hdfs dfs -cat /tmp/flume/*/namenode-log-* | head -5


🔧 1.6 Flink on Yarn安装配置

步骤1~2:解压、环境变量

(同前,略)

步骤3:运行WordCount任务

flink run -m yarn-cluster -p 2 -yjm 1024 -ytm 1024 $FLINK_HOME/examples/batch/WordCount.jar

  • 截图点:输出最后10行包含 Job has been successfully submitted

  • ⚠️ 必须用 -m yarn-cluster(per-job模式),不能用 yarn-session


🗃️ 二、模块一:数据库配置维护(M1-T2)

2.1 创建数据库与表


-- 登录MySQL

mysql -uroot -p123456

  


-- 创建数据库

CREATE DATABASE IF NOT EXISTS test;

USE test;

  


-- 创建stu表

CREATE TABLE stu (

  学号 VARCHAR(20) PRIMARY KEY,

  姓名 VARCHAR(20),

  性别 VARCHAR(2),

  专业 VARCHAR(30),

  班级 VARCHAR(20),

  学院 VARCHAR(20)

);

  


-- 创建course表

CREATE TABLE course (

  课程号 VARCHAR(10) PRIMARY KEY,

  课程名称 VARCHAR(30),

  开设学院 VARCHAR(20),

  学分 INT

);

  


-- 创建score表

CREATE TABLE score (

  学号 VARCHAR(20),

  课程号 VARCHAR(10),

  成绩 DOUBLE,

  PRIMARY KEY(学号, 课程号)

);

2.2 插入数据


-- stu表数据

INSERT INTO stu VALUES

('2020010101','黄洋华','男','计算机','20计算机1班','电子'),

('2021020201','张明洋','男','物联网','21物联网2班','电子'),

('2022030105','章小明','女','市场营销','22市营1班','经管'),

('2021040306','宝明文','男','机器人','21机器人1班','智能'),

('2022030212','曲飞飞','女','市场营销','22市营1班','经管'),

('2022050219','陈大华','男','电气自动化','22电气1班','智能'),

('2021010423','徐宝文','男','计算机','21计算机1班','电子'),

('2022080229','赵宝宝','女','会计','22会计1班','经管');

  


-- course表数据

INSERT INTO course VALUES

('KCDZ01','C语言程序设计','电子学院',3),

('KCJG01','会计大数据分析','经管学院',3),

('KJZN01','自动控制应用','智能学院',3),

('KCDZ02','人工智能概论','电子学院',2),

('KJJG02','市场营销实践','经管学院',2);

  


-- score表数据(26条,略,按原手册)

2.3 数据查询


-- 查询电子学院学生成绩

SELECT stu.学号, 姓名, 课程名称, 成绩

FROM stu

JOIN score ON stu.学号 = score.学号

JOIN course ON score.课程号 = course.课程号

WHERE 学院 = '电子';

  


-- 查询选修特定课程的学生

SELECT stu.姓名, 课程名称, 成绩

FROM stu

JOIN score ON stu.学号 = score.学号

JOIN course ON score.课程号 = course.课程号

WHERE score.课程号 IN ('KCJG01', 'KCDZ02');

  


-- 查询姓名末尾带"华"的学生

SELECT 姓名, 课程名称, 成绩

FROM stu

JOIN score ON stu.学号 = score.学号

JOIN course ON score.课程号 = course.课程号

WHERE 姓名 LIKE '%华';

  • ⚠️ LIKE '%华' 不能写成 '%华%',题目要求末尾匹配


🧹 三、模块二:数据获取与处理(M2)

3.1 数据清洗(Python)


import pandas as pd

from sklearn.preprocessing import LabelEncoder

  


# 读取数据

df = pd.read_csv('ZZ052-7-M2-T1-SUBT1/train.csv')

print(df.head())

  


# 查看统计信息

print("数据总数:", len(df))

print(df.describe())

  


# job列缺失值处理

df['job'] = df['job'].fillna('admin.')

  


# marital列按年龄填充

df.loc[(df['age'] < 30) & (df['marital'].isnull()), 'marital'] = 'single'

df.loc[(df['age'] > 50) & (df['marital'].isnull()), 'marital'] = 'divorced'

df.loc[(df['age'] >= 30) & (df['age'] <= 50) & (df['marital'].isnull()), 'marital'] = 'married'

  


# education列合并

edu_map = {'basic.9y': 'Basic', 'basic.6y': 'Basic', 'basic.4y': 'Basic', 'unknown': 'Basic'}

df['education'] = df['education'].replace(edu_map)

  


# housing列按default处理

df.loc[df['default'] == 'yes', 'housing'] = 'yes'

df['housing'] = df['housing'].fillna('no')

  


# loan列按housing处理

df.loc[df['housing'] == 'yes', 'loan'] = 'yes'

df['loan'] = df['loan'].fillna('no')

  


# 保存第一步结果

df.to_csv('train_c1.csv', index=False)

  


# 四分位法去异常值

numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

Q1 = df[numeric_cols].quantile(0.25)

Q3 = df[numeric_cols].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

for col in numeric_cols:

    df = df[(df[col] >= lower_bound[col]) & (df[col] <= upper_bound[col])]

df.to_csv('train_c2.csv', index=False)

  


# 标签编码(先填充NaN)

df = df.fillna('missing')

category_cols = df.select_dtypes(include=['object']).columns

le = LabelEncoder()

for col in category_cols:

    df[col] = le.fit_transform(df[col].astype(str))

df.to_csv('train_c3.csv', index=False)

3.2 数据标注


df = pd.read_csv('train_c3.csv')

df['subscribe'] = df['subscribe'].map({1: 'yes', 0: 'no'})

df.to_csv('result.csv', index=False)

3.3 HDFS操作


# 创建本地目录

mkdir -p /root/result

  


# 上传到HDFS

hdfs dfs -mkdir -p /result

hdfs dfs -put /root/result /result

  


# 下载验证

hdfs dfs -get /result /root/

3.4 MapReduce:清洗异常数据


// CleanDataMapper.java

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

  


public class CleanDataMapper extends Mapper<LongWritable, Text, Text, Text> {

    @Override

    protected void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] fields = value.toString().split(",");

        boolean valid = true;

        for (String field : fields) {

            if (field.length() < 11) {

                valid = false;

                break;

            }

        }

        if (valid) {

            context.write(NullWritable.get(), value);

        }

    }

}


// CleanDataReducer.java

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

  


public class CleanDataReducer extends Reducer<Text, Text, Text, Text> {

    @Override

    protected void reduce(Text key, Iterable<Text> values, Context context)

            throws IOException, InterruptedException {

        for (Text val : values) {

            context.write(NullWritable.get(), val);

        }

    }

}


// CleanDataDriver.java

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

  


public class CleanDataDriver {

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "Clean Data");

        job.setJarByClass(CleanDataDriver.class);

        job.setMapperClass(CleanDataMapper.class);

        job.setReducerClass(CleanDataReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}


# 编译运行

javac -classpath `hadoop classpath` -d . CleanData*.java

jar cf cleanData.jar *.class

hadoop jar cleanData.jar CleanDataDriver /input/sku_info.csv /output_clean

hdfs dfs -cat /output_clean/part-r-00000 | head -20

3.5 MapReduce:统计用户性别


// GenderCountMapper.java

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

  


public class GenderCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable one = new IntWritable(1);

    @Override

    protected void map(LongWritable key, Text value, Context context)

            throws IOException, InterruptedException {

        String[] fields = value.toString().split(",");

        if (fields.length > 8) {

            String gender = fields[8];

            context.write(new Text(gender), one);

        }

    }

}


// GenderCountReducer.java

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

  


public class GenderCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    @Override

    protected void reduce(Text key, Iterable<IntWritable> values, Context context)

            throws IOException, InterruptedException {

        int sum = 0;

        for (IntWritable val : values) sum += val.get();

        context.write(key, new IntWritable(sum));

    }

}


// GenderCountDriver.java(结构同CleanDataDriver,略)


# 编译运行

javac -classpath `hadoop classpath` -d . GenderCount*.java

jar cf genderCount.jar *.class

hadoop jar genderCount.jar GenderCountDriver /input/user_info.csv /output_gender

hdfs dfs -cat /output_gender/part-r-00000


📊 四、模块三:业务分析与可视化(M3)

4.1 学历岗位百分比图(Python)


import pandas as pd

import matplotlib.pyplot as plt

  


df = pd.read_csv('ZZ052-7-M3-T1-SUBT1/ANALYSE.xlsx')

edu_count = df['学历'].value_counts(normalize=True) * 100

  


plt.figure(figsize=(8, 6))

plt.pie(edu_count, labels=edu_count.index, autopct='%1.1f%%', startangle=90)

plt.title('每种学历岗位数的百分比')

plt.axis('equal')

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.savefig('学历岗位百分比.png', dpi=150, bbox_inches='tight')

plt.show()

  • ⚠️ 必须设置 normalize=True 计算百分比 + 中文字体

4.2 热门技术柱状图(ECharts)


function getHotskill() {

  return {

    yAxis: {

      type: 'category',

      data: hotskill.map(item => item.name),

      axisLabel: { color: '#ffffff' }

    },

    xAxis: { type: 'value' },

    series: [{

      type: 'bar',

      data: hotskill.map(item => item.value),

      label: { show: true, position: 'right', color: '#ffffff' },

      itemStyle: {

        color: new echarts.graphic.LinearGradient(0, 0, 1, 0, [

          {offset: 0, color: '#00a0e9'},

          {offset: 1, color: '#33c3f0'}

        ])

      }

    }],

    tooltip: { trigger: 'axis', axisPointer: { type: 'shadow' } },

    grid: { left: '10%', right: '10%', bottom: '15%', top: '10%' }

  };

}

4.3 学历分布饼图(ECharts)


function getSalaryData() {

  return {

    legend: {

      orient: 'vertical',

      right: 10,

      top: 'center',

      data: ['大专', '本科', '硕士', '博士'],

      textStyle: { color: '#ffffff' },

      formatter: '学历分布'

    },

    series: [{

      name: '学历分布',

      type: 'pie',

      radius: ['20%', '55%'],

      center: ['40%', '50%'],

      roseType: 'angle',

      label: { show: true, formatter: '{b}: {d}%' },

      data: salary

    }]

  };

}

4.4 业务分析:职业购买意向


import pandas as pd

import matplotlib.pyplot as plt

  


df = pd.read_csv('train_c1.csv')

buy_by_job = df.groupby('job')['subscribe'].value_counts(normalize=True).unstack()

  


plt.figure(figsize=(12, 6))

buy_by_job['yes'].sort_values(ascending=False).plot(kind='bar', color='skyblue')

plt.title('不同职业的客户购买银行产品意向')

plt.xlabel('职业')

plt.ylabel('购买意向比例')

plt.xticks(rotation=45, ha='right')

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.tight_layout()

plt.savefig('职业购买意向.png', dpi=150, bbox_inches='tight')

plt.show()

分析文字


管理岗位和专业人士购买意向最高(35%/32%),学生和退休人员最低(5%/8%)。

建议:针对高意向群体设计个性化产品,针对低意向群体开展金融知识普及。

4.5 分布分析图(6子图)


import seaborn as sns

sns.set(style="whitegrid")

plt.rcParams['font.sans-serif'] = ['SimHei']

plt.rcParams['axes.unicode_minus'] = False

  


fig, axes = plt.subplots(3, 2, figsize=(14, 12))

  


# 年龄

sns.histplot(df['age'], kde=True, ax=axes[0,0], color='skyblue')

axes[0,0].set_title('年龄分布直方图')

sns.kdeplot(df['age'], ax=axes[0,1], color='darkblue')

axes[0,1].set_title('年龄分布概率密度图')

  


# 联系时长

sns.histplot(df['duration'], kde=True, ax=axes[1,0], color='lightgreen')

axes[1,0].set_title('联系时长分布直方图')

sns.kdeplot(df['duration'], ax=axes[1,1], color='darkgreen')

axes[1,1].set_title('联系时长分布概率密度图')

  


# 联系次数

sns.histplot(df['campaign'], kde=True, ax=axes[2,0], color='salmon')

axes[2,0].set_title('联系次数分布直方图')

sns.kdeplot(df['campaign'], ax=axes[2,1], color='darkred')

axes[2,1].set_title('联系次数分布概率密度图')

  


plt.tight_layout()

plt.savefig('分布分析.png', dpi=150, bbox_inches='tight')

plt.show()

  • ⚠️ 必须6个子图在一张图中,设置中文字体


📁 五、提交文件命名规范

模块一

| 任务 | 提交文件名 | 关键截图 |

|------|-----------|---------|

| M1-T1-SUBT1-1 | M1-T1-SUBT1-提交结果1.docx | JDK/Hadoop解压命令 |

| M1-T1-SUBT1-2 | M1-T1-SUBT1-提交结果2.docx | java -version + hdfs version |

| M1-T1-SUBT1-3 | M1-T1-SUBT1-提交结果3.docx | core-site.xml配置 |

| M1-T1-SUBT1-4 | M1-T1-SUBT1-提交结果4.docx | namenode格式化成功日志 |

| M1-T1-SUBT1-5 | M1-T1-SUBT1-提交结果5.docx | master+slave1的jps进程 |

| M1-T1-SUBT2-1~3 | M1-T1-SUBT2-提交结果X.docx | MySQL/Hive安装验证 |

| M1-T1-SUBT3-1~3 | M1-T1-SUBT3-提交结果X.docx | Flume+HDFS数据验证 |

| M1-T1-SUBT4-1~3 | M1-T1-SUBT4-提交结果X.docx | Flink WordCount成功日志 |

模块二 & 三

(保持原手册规范,略)


⚠️ 六、注意事项


🔴 高危项(直接0分):

1. 重复格式化NameNode → 数据丢失

2. Flink用 -m yarn-session → 模式错误

3. 路径用相对路径 → 命令执行失败

4. XML中&未转义为&amp; → Hive初始化失败

  


🟡 中危项(扣分项):

1. 环境变量$未转义\$ → 配置失效

2. hadoop-env.sh未配_USER=root → 启动权限错误

3. core-site.xml未配hadoop.tmp.dir → 重启元数据丢失

4. Python未设SimHei字体 → 图表中文显示方框

5. LabelEncoder未先填NaN → 编码报错

  


🟢 检查清单:

✅ 三节点防火墙已关闭

✅ hosts三台内容一致

✅ 免密登录三台可通

✅ jps进程完整无缺失

✅ 截图包含完整命令+输出

✅ 文件命名完全匹配规范


✅ 操作流程速查


【前置】关防火墙 → 设主机名 → 配hosts → 免密登录 → 时间同步

【安装】解压JDK/Hadoop → 配环境变量(\$转义) → 验证版本

【配置】core-site.xml(fs.defaultFS+tmp.dir) → hdfs-site.xml(仅replication)

        → yarn-site.xml → workers → hadoop-env.sh(JAVA_HOME+_USER=root)

【分发】scp分发Hadoop+profile → slave节点source生效

【启动】master格式化(仅1次!) → start-dfs.sh → start-yarn.sh → 逐节点jps

【组件】MySQL → Hive(&amp;转义) → Flume(绝对日志路径) → Flink(yarn-cluster)

【数据】Python清洗 → MapReduce开发 → HDFS上传下载 → ECharts可视化

🎯 核心原则:路径不改、主机名不改、配置精简、每步验证、截图规范。

按照本手册操作,规避配置冗余,稳拿高分!🚀