在原生Windows系统部署Hadoop和Spark

239 阅读6分钟

环境:Windows11、hadoop-3.3.6、spark-3.5.4-bin-without-hadoop


  1. 解压缩下载hadoop-3.3.6后,在目录文件夹下新建data文件夹,并在data文件夹下新建datanode和namenode。

image.png

  1. 更改配置文件
  • D:\softwares\hadoop-3.3.6\etc\hadoop\hadoop-env.cmd
    核心是24行的set JAVA_HOME=%JAVA_HOME%和末尾的指定 JVM 在加载本地库(如 .dll 文件(Windows))时的搜索路径。对所有 Hadoop 子命令(如 hdfsmapredyarn)生效(hadoop-env.cmd)
  • Windows 下 Hadoop 需要 hadoop.dll 和 winutils.exe 支持本地文件系统操作。 如果未设置 JAVA_LIBRARY_PATH,Spark 会报错
@rem Licensed to the Apache Software Foundation (ASF) under one or more
@rem contributor license agreements.  See the NOTICE file distributed with
@rem this work for additional information regarding copyright ownership.
@rem The ASF licenses this file to You under the Apache License, Version 2.0
@rem (the "License"); you may not use this file except in compliance with
@rem the License.  You may obtain a copy of the License at
@rem
@rem     http://www.apache.org/licenses/LICENSE-2.0
@rem
@rem Unless required by applicable law or agreed to in writing, software
@rem distributed under the License is distributed on an "AS IS" BASIS,
@rem WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@rem See the License for the specific language governing permissions and
@rem limitations under the License.

@rem Set Hadoop-specific environment variables here.

@rem The only required environment variable is JAVA_HOME.  All others are
@rem optional.  When running a distributed configuration it is best to
@rem set JAVA_HOME in this file, so that it is correctly defined on
@rem remote nodes.

@rem The java implementation to use.  Required.
set JAVA_HOME=%JAVA_HOME%

@rem The jsvc implementation to use. Jsvc is required to run secure datanodes.
@rem set JSVC_HOME=%JSVC_HOME%

@rem set HADOOP_CONF_DIR=

@rem Extra Java CLASSPATH elements.  Automatically insert capacity-scheduler.
if exist %HADOOP_HOME%\contrib\capacity-scheduler (
  if not defined HADOOP_CLASSPATH (
    set HADOOP_CLASSPATH=%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  ) else (
    set HADOOP_CLASSPATH=%HADOOP_CLASSPATH%;%HADOOP_HOME%\contrib\capacity-scheduler\*.jar
  )
)

@rem The maximum amount of heap to use, in MB. Default is 1000.
@rem set HADOOP_HEAPSIZE=
@rem set HADOOP_NAMENODE_INIT_HEAPSIZE=""

@rem Extra Java runtime options.  Empty by default.
@rem set HADOOP_OPTS=%HADOOP_OPTS% -Djava.net.preferIPv4Stack=true

@rem Command specific options appended to HADOOP_OPTS when specified
if not defined HADOOP_SECURITY_LOGGER (
  set HADOOP_SECURITY_LOGGER=INFO,RFAS
)
if not defined HDFS_AUDIT_LOGGER (
  set HDFS_AUDIT_LOGGER=INFO,NullAppender
)

set HADOOP_NAMENODE_OPTS=-Dhadoop.security.logger=%HADOOP_SECURITY_LOGGER% -Dhdfs.audit.logger=%HDFS_AUDIT_LOGGER% %HADOOP_NAMENODE_OPTS%
set HADOOP_DATANODE_OPTS=-Dhadoop.security.logger=ERROR,RFAS %HADOOP_DATANODE_OPTS%
set HADOOP_SECONDARYNAMENODE_OPTS=-Dhadoop.security.logger=%HADOOP_SECURITY_LOGGER% -Dhdfs.audit.logger=%HDFS_AUDIT_LOGGER% %HADOOP_SECONDARYNAMENODE_OPTS%

@rem The following applies to multiple commands (fs, dfs, fsck, distcp etc)
set HADOOP_CLIENT_OPTS=-Xmx512m %HADOOP_CLIENT_OPTS%
@rem set HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData %HADOOP_JAVA_PLATFORM_OPTS%"

@rem On secure datanodes, user to run the datanode as after dropping privileges
set HADOOP_SECURE_DN_USER=%HADOOP_SECURE_DN_USER%

@rem Where log files are stored.  %HADOOP_HOME%/logs by default.
@rem set HADOOP_LOG_DIR=%HADOOP_LOG_DIR%\%USERNAME%

@rem Where log files are stored in the secure data environment.
set HADOOP_SECURE_DN_LOG_DIR=%HADOOP_LOG_DIR%\%HADOOP_HDFS_USER%

@rem
@rem Router-based HDFS Federation specific parameters
@rem Specify the JVM options to be used when starting the RBF Routers.
@rem These options will be appended to the options specified as HADOOP_OPTS
@rem and therefore may override any similar flags set in HADOOP_OPTS
@rem
@rem set HADOOP_DFSROUTER_OPTS=""
@rem

@rem The directory where pid files are stored. /tmp by default.
@rem NOTE: this should be set to a directory that can only be written to by 
@rem       the user that will run the hadoop daemons.  Otherwise there is the
@rem       potential for a symlink attack.
set HADOOP_PID_DIR=%HADOOP_PID_DIR%
set HADOOP_SECURE_DN_PID_DIR=%HADOOP_PID_DIR%

@rem A string representing this instance of hadoop. %USERNAME% by default.
set HADOOP_IDENT_STRING=%USERNAME%

@rem 设置 Hadoop 本地库路径
set JAVA_LIBRARY_PATH=D:\softwares\hadoop-3.3.6\bin
@rem 将 Hadoop 的 bin 目录加入 PATH
set PATH=%PATH%;D:\softwares\hadoop-3.3.6\bin

- D:\softwares\hadoop-3.3.6\etc\hadoop\core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
    <value>D:/hadoop-3.3.6/data/tmp</value> 
    <!-- 自定义临时目录 -->
  </property>

</configuration>

解释:
在Hadoop的配置中,fs.defaultFS是用于指定默认文件系统的标识符。这个属性告诉Hadoop客户端,默认情况下应该连接到哪个文件系统去执行读写操作。
hdfs://localhost:9000。这意味着默认的文件系统是一个运行在本地主机(localhost)上的HDFS实例,通过端口9000进行通信。这里使用的是HDFS协议(由URI方案hdfs://表示),这表明这是一个分布式的文件系统而不是本地文件系统或其他类型的存储服务。

- D:\softwares\hadoop-3.3.6\etc\hadoop\hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.permissions</name>
    <value>false</value>
  </property>

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/D:/softwares/hadoop-3.3.6/data/namenode</value>
  </property>
  
  <property>
    <name>fs.checkpoint.dir</name>
    <value>/D:/softwares/hadoop-3.3.6/data/snn</value>
  </property>
  
  <property>
    <name>fs.checkpoint.edits.dir</name>
    <value>/D:/softwares/hadoop-3.3.6/data/snn</value>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/D:/softwares/hadoop-3.3.6/data/datanode</value>
  </property>
  
   <property>
    <name>dfs.block.size</name>
    <value>67108864</value> 
    <!-- 64MB调整 hdfs-site.xml 中的块大小(默认 128MB): -->
  </property>
</configuration>
  • 解释:

    • <name>dfs.replication</name>:这个属性指定了数据块在HDFS中的副本数量。 <value>1</value>:这里设置为1,表示每个数据块仅有一个副本。

    • <name>dfs.permissions</name>:控制是否启用HDFS的权限检查功能。
      <value>false</value>:设置为false表示禁用了权限检查,允许所有用户对文件系统进行读写操作,而无需考虑文件的所有权和权限。

    • <name>dfs.namenode.name.dir</name>:指定NameNode用于存储文件系统元数据的本地目录位置。

    • <name>fs.checkpoint.dir</name>:Secondary NameNode用来存储检查点镜像的地方。

    • <name>fs.checkpoint.edits.dir</name>:与fs.checkpoint.dir类似,但是专门用于存储编辑日志的检查点。

    • <name>dfs.datanode.data.dir</name>:DataNode用于存储实际数据块的本地磁盘目录列表。

  • D:\softwares\hadoop-3.3.6\etc\hadoop\yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>


  <property>
    <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>


</configuration>
  • 解释: 这两个配置项主要用于启用并配置 YARN 上运行 MapReduce 任务时的 Shuffle(洗牌)机制

    • 启用 NodeManager 的辅助服务(auxiliary services),这里启用了 mapreduce_shuffle。 启用后,每个 NodeManager 会启动一个 ShuffleHandler,用于将 Map 任务产生的中间输出提供给 Reduce 任务下载。

    • 指定用于处理 shuffle 的类。默认值就是org.apache.hadoop.mapred.ShuffleHandler,它负责接收来自 Reducer 的 HTTP 请求,返回对应的 Map 输出数据。

- D:\softwares\hadoop-3.3.6\etc\hadoop\mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
  • 解释:
    • 它的核心作用是告诉 MapReduce 任务:使用 YARN 作为执行框架。如果不设置这个参数或设置错误(如设为 local、classic 等),你的 MapReduce 程序将无法在分布式集群上运行,只能以本地模式运行或者根本不能运行。
  1. 下载对应版本的winutils,复制到bin文件夹下。

image.png

  1. 配置环境变量

    不做赘述

  2. 启动

格式化 HDFS: hdfs namenode -format
启动 HDFS:start-dfs.cmd
启动 yarn:start-yarn.cmd
检查进程:jps

  1. 初始化
hadoop fs -mkdir /user
hadoop fs -mkdir /user/zss

Spark配置

  • 环境变量的配置不做赘述
  1. 复制spark-env.sh.templatespark-env.cmd
  2. 修改D:\softwares\spark-3.5.4-bin-without-hadoop\conf\spark-env.cmd
@echo off
setlocal enabledelayedexpansion

rem 设置Hadoop集成路径
set HADOOP_HOME=D:\softwares\hadoop-3.3.6
set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop

@REM rem 设置 Spark 的 Hadoop 类路径(关键!)
@REM set SPARK_DIST_CLASSPATH= 
@REM for /r "%HADOOP_HOME%\share\hadoop" %%i in (*.jar) do set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%%i
@REM rem 添加 Hadoop Native Lib(hadoop.dll)
@REM set SPARK_DIST_CLASSPATH=%SPARK_DIST_CLASSPATH%;%HADOOP_HOME%\bin

set SPARK_DIST_CLASSPATH=
set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%HADOOP_HOME%\share\hadoop\common\*
set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%HADOOP_HOME%\share\hadoop\hdfs\*
set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%HADOOP_HOME%\share\hadoop\mapreduce\*
set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%HADOOP_HOME%\share\hadoop\yarn\*
set SPARK_DIST_CLASSPATH=!SPARK_DIST_CLASSPATH!;%HADOOP_CONF_DIR%

rem Spark 基础配置
set SPARK_LOCAL_IP=127.0.0.1
set SPARK_MASTER_HOST=localhost
set SPARK_WORKER_MEMORY=4g

启动hadoop后,执行如下命令启动spark:spark-shell --master local[*]

image.png