Flink单机运行实战:从Local模式到流处理Demo

一、Flink Local模式安装
1.1 环境准备
flowchart TD
A[操作系统] --> B{Linux/macOS/Windows}
B -->|Linux| C[CentOS/Ubuntu]
B -->|Windows| D[WSL2推荐]
C --> E[JDK8/11安装]
D --> E
系统要求:
- JDK 8/11(推荐JDK11)
- 可用内存 ≥ 2GB
- 磁盘空间 ≥ 500MB
1.2 安装步骤
wget https://archive.apache.org/dist/flink/flink-1.17.2/flink-1.17.2-bin-scala_2.12.tgz
tar -xzf flink-1.17.2-bin-scala_2.12.tgz -C /opt
sudo ln -s /opt/flink-1.17.2 /opt/flink
echo 'export FLINK_HOME=/opt/flink' >> ~/.bashrc
echo 'export PATH=$FLINK_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
1.3 验证安装
import os
def check_flink():
try:
output = os.popen("flink --version").read()
if "Flink" in output:
print("✅ Flink安装验证成功")
print(output.splitlines()[0])
else:
print("❌ Flink未正确安装")
except Exception as e:
print(f"检查失败: {str(e)}")
check_flink()
二、实时流处理Demo:Socket词频统计
2.1 数据处理流程图
sequenceDiagram
participant Socket as Socket服务器
participant Source as Flink Source
participant Process as 处理逻辑
participant Sink as 输出Sink
Socket->>Source: 发送文本数据
Source->>Process: 拆分单词
Process->>Process: 统计词频
Process->>Sink: 打印结果
2.2 Python实现代码
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import FlatMapFunction, RuntimeContext
from pyflink.common.typeinfo import Types
from pyflink.datastream.window import TumblingProcessingTimeWindows
class WordCounter(FlatMapFunction):
def __init__(self):
self.word_counts = None
def open(self, runtime_context: RuntimeContext):
self.word_counts = {}
def flat_map(self, value):
words = value[0].lower().split()
for word in words:
self.word_counts[word] = self.word_counts.get(word, 0) + 1
yield (word, self.word_counts[word])
def socket_demo():
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
env.add_jars("file:///opt/flink/lib/flink-python-1.17.2.jar")
socket_stream = env.socket_text_stream("localhost", 9999)
result = (
socket_stream
.flat_map(lambda line: [(word, 1) for word in line.split()])
.key_by(lambda x: x[0])
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(1)
)
result.print()
env.execute("Socket WordCount")
if __name__ == '__main__':
socket_demo()
2.3 运行步骤
- 启动Netcat服务端:
nc -lk 9999
- 另开终端启动Flink作业:
python socket_wordcount.py
- 在Netcat终端输入测试数据:
Hello Flink
Hello Python
Flink Streaming Demo
2.4 窗口计算原理
WindowCount(t) = \sum_{i=t-w}^{t} event_i
其中:
- ( w ) = 窗口大小(示例中5秒)
- ( t ) = 当前处理时间
三、Flink Web Dashboard解析
3.1 Dashboard界面布局
graph TD
A[仪表盘] --> B[JobManager]
A --> C[TaskManager]
A --> D[运行中的作业]
A --> E[已完成作业]
B --> F[内存使用]
B --> G[任务槽状态]
C --> H[CPU负载]
C --> I[网络吞吐]
3.2 关键指标说明
| 指标区域 | 关键参数 | 健康标准 |
|---|
| JobManager | Heap Used | < 80% 最大堆内存 |
| TaskManager | Number of TaskSlots | 可用槽位 > 0 |
| 作业概览 | Records Sent/Received | 持续增长无中断 |
| 检查点 | Last Checkpoint Duration | < 平均值的200% |
3.3 作业执行图解析
flowchart LR
Source[Socket Source] -->|原始数据| Map[拆分单词]
Map -->|(word,1)| KeyBy[按单词分组]
KeyBy --> Window[5秒滚动窗口]
Window --> Sum[求和统计]
Sum --> Print[输出结果]
四、Local模式高级配置
4.1 内存调优参数
env = StreamExecutionEnvironment.create_local_execution_environment(
parallelism=2,
configuration={
"taskmanager.memory.process.size": "2048m",
"taskmanager.numberOfTaskSlots": "2"
}
)
4.2 状态后端配置
from pyflink.datastream import StateBackend
from pyflink.datastream import FsStateBackend
state_backend = FsStateBackend("file:///tmp/checkpoints")
env.set_state_backend(state_backend)
env.enable_checkpointing(5000)
五、常见问题排查
5.1 故障诊断表
| 现象 | 可能原因 | 解决方案 |
|---|
| 无法连接Socket | 端口未开放/服务未启动 | 检查防火墙设置,确认nc命令运行 |
| 无数据处理 | 时间窗口未触发 | 确认数据发送频率超过窗口周期 |
| 状态数据丢失 | 检查点配置错误 | 启用检查点并验证存储路径 |
5.2 日志文件定位
tail -f $FLINK_HOME/log/flink-*-jobmanager-*.log
tail -f $FLINK_HOME/log/flink-*-taskmanager-*.log
六、性能优化建议
6.1 Local模式调优矩阵
\text{最大吞吐量} = \frac{\text{处理能力}}{\text{窗口大小}} \times \text{并行度}
6.2 配置参数优化
| 参数 | 推荐值 | 说明 |
|---|
| taskmanager.memory.task.heap.size | 1024m | 单个任务堆内存 |
| parallelism.default | CPU核心数 | 默认并行度 |
| task.cancellation.timeout | 30000 | 任务取消超时时间(ms) |
扩展实践:尝试修改窗口类型为滑动窗口(SlidingWindow),并观察统计结果的变化。完整代码示例可在GitHub仓库获取。
附录:Flink常用命令速查
| 功能 | 命令 |
|---|
| 提交作业 | flink run -py <script.py> |
| 取消作业 | flink cancel |
| 列出运行中作业 | flink list |
| 查看作业详情 | flink info |