大数据成长之路——HDFS

235 阅读7分钟

小知识,大挑战!本文正在参与「程序员必备小知识」创作活动

本文已参与 「掘力星计划」 ,赢取创作大礼包,挑战创作激励金。


源码见:github.com/hiszm/hadoo…


HDFS概述(Hadoop Distributed File System)

  • 分布式的
  • commodity、low-cost hardware:去中心化IoE
  • fault-tolerant:高容错 , 默认采用3副本机制
  • high throughput:移动计算比移动数据成本低
  • large data sets:大规模的数据集 , 基本都是GB和TB级别

HDFS架构详解

  • NameNode(master) / DataNodes(slave) HDFS 遵循主/从架构master/slave,由 单个 NameNode(NN) 和多个 DataNode(DN) 组成:

    • NameNode : 负责执行有关 文件系统命名空间 the file system namespace 的操作,大多数文件系统类似 (如 Linux) , 支持 增删改查 文件和目录等。它同时还负责集群元数据的存储,记录着文件中各个数据块的位置信息。
    • DataNode:负责提供来自文件系统客户端的读写请求,执行块的创建,删除等操作。
  • HDFS 将每一个文件存储为一系列,每个块由多个副本来保证容错,这些块存储在DN中, 当然这些块的大小和复制因子可以自行配置( 默认情况下,块大小是 128M,默认复制因子是 3 )。

  • 环境运行在 GNU/Linux 中. HDFS 用的是 Java 语言

Datanodes

官方文档链接

举例 一个a.txt 共有150M 一个blocksize为128M 则会拆分两个block 一个是block1: 128M ; 另个block2: 22M

那么问题来了, block1 和block2 要存放在哪个DN里面? 这个 对于用户是透明的 , 这个就要用 HDFS来完成

HdfsDesign

  • 文件系统Namespace

    • user or an application can create directories and store files inside these directories. (可以CURD)
    • The file system namespace hierarchy is similar to most other existing file systems; ( 类似于Linux)
    • one can create and remove files, move a file from one directory to another, or rename a file. (可以CURD)
    • HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. (不支持硬链接)
    • The NameNode maintains the file system namespace(唯一NN,多个DN)
  • 架构的稳定性

    • 心跳机制和重新复制 : 每个 DataNode 定期向 NameNode 发送心跳消息,如果超过指定时间没有收到心跳消息,则将 DataNode 标记为死亡。NameNode 不会将任何新的 IO 请求转发给标记为死亡的 DataNode,也不会再使用这些 DataNode 上的数据。 由于数据不再可用,可能会导致某些块的复制因子小于其指定值,NameNode 会跟踪这些块,并在 必要的时候进行重新复制

    • 数据的完整性 : 由于存储设备故障等原因,存储在 DataNode 上的数据块也会发生损坏。为了避免读取到已经损坏的数据而导致错误,HDFS 提供了数据完整性校验机制来保证数据的完整性,具体操作如下:当客户端创建 HDFS 文件时,它会计算文件的每个块的 校验和,并将 校验和 存储在同一 HDFS 命名空间下的单独的隐藏文件中。当客户端检索文件内容时,它会验证从每个 DataNode 接收的数据是否与存储在关联校验和文件中的 校验和 匹配。如果匹配失败,则证明数据已经损坏,此时客户端会选择从其他 DataNode 获取该块的其他可用副本。

    • 元数据的磁盘故障 : FsImageEditLog 是 HDFS 的核心数据,这些数据的意外丢失可能会导致整个 HDFS 服务不可用。为了避免这个问题,可以配置 NameNode 使其支持 FsImageEditLog 多副本同步,这样 FsImageEditLog 的任何改变都会引起每个副本 FsImageEditLog 的同步更新。

    • 支持快照 : 快照支持在特定时刻存储数据副本,在数据意外损坏时,可以通过回滚操作恢复到健康的数据状态。

HDFS副本机制

  • It stores each file as a sequence of blocks. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file.( 默认情况下,块大小是 128M,默认复制因子是 3)

  • An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.( 复制因子和块大小, 可以改的)

为了最大限度地减少带宽消耗和读取延迟,HDFS 在执行读取请求时,优先读取距离读取器最近的副本。如果在与读取器节点相同的机架上存在副本,则优先选择该副本。如果 HDFS 群集跨越多个数据中心,则优先选择本地数据中心上的副本。

Linux环境介绍

(base) JackSundeMBP:~ jacksun$ ssh hadoop@192.168.68.200

[hadoop@hadoop000 ~]$ pwd
/home/hadoop


[hadoop@hadoop000 ~]$ ls
app   Desktop    Downloads  maven_resp  Pictures  README.txt  software   t.txt
data  Documents  lib        Music       Public    shell       Templates  Videos

文件名用途
software软件安装包
app软件安装目录
data数据
libjar包
shell脚本
maven_respmaven依赖包

[hadoop@hadoop000 ~]$ sudo vi /etc/hosts

192.168.68.200 hadoop000

Hadoop部署

JDK1.8部署详解

  • 获得文件scp jdk_name hadoop@192/168.1.200
  • 安装jdk tar -zvxf jdk_name -C ~/app
  • 配置系统环境 vi .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91
export PATH=$JAVA_HOME/bin:$PATH

修改生效source .bash_profile

java -version

java version "1.8.0_91" Java(TM) SE Runtime Environment (build 1.8.0_91-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

打印出上述则安装成功

ssh无密码登陆部署详解

  • 创建密钥ssh-keygen -t rsa

  • cd .ssh

  • 公钥输入到key里面cat id_rsa.pub >> authorized_keys

-rw------- 1 hadoop hadoop  796 8月  16 06:17 authorized_keys
-rw------- 1 hadoop hadoop 1675 8月  16 06:14 id_rsa
-rw-r--r-- 1 hadoop hadoop  398 8月  16 06:14 id_rsa.pub
-rw-r--r-- 1 hadoop hadoop 1230 8月  16 18:05 known_hosts

id_rsa 私钥 id_rsa.pub 公钥

[hadoop@hadoop000 ~]$ ssh localhost 
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is SHA256:LZvkeJHnqH0AtihqFB2AcQJKwMpH1/DorPi0bIEKcQM.
ECDSA key fingerprint is MD5:9f:b5:f3:bd:f2:aa:61:97:8b:8a:e2:a3:98:5a:e4:3d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Last login: Sun Aug 16 18:03:23 2020 from 192.168.1.3
[hadoop@hadoop000 ~]$ ls
app              Desktop    lib         Pictures    shell      t.txt
authorized_keys  Documents  maven_resp  Public      software   Videos
data             Downloads  Music       README.txt  Templates
[hadoop@hadoop000 ~]$ ssh localhost 
Last login: Sun Aug 16 18:05:21 2020 from 127.0.0.1

Hadoop安装目录详解及hadoop-env配置

配置JAVA_HOME

[hadoop@hadoop000 hadoop]$ ls
capacity-scheduler.xml      httpfs-env.sh            mapred-env.sh
configuration.xsl           httpfs-log4j.properties  mapred-queues.xml.template
container-executor.cfg      httpfs-signature.secret  mapred-site.xml
core-site.xml               httpfs-site.xml          mapred-site.xml.template
hadoop-env.cmd              kms-acls.xml             slaves
hadoop-env.sh               kms-env.sh               ssl-client.xml.example
hadoop-metrics2.properties  kms-log4j.properties     ssl-server.xml.example
hadoop-metrics.properties   kms-site.xml             yarn-env.cmd
hadoop-policy.xml           log4j.properties         yarn-env.sh
hdfs-site.xml               mapred-env.cmd           yarn-site.xml
[hadoop@hadoop000 hadoop]$ pwd
/home/hadoop/app/hadoop-2.6.0-cdh5.15.1/etc/hadoop
[hadoop@hadoop000 hadoop]$ sudo vi hadoop-env.sh 

-----------------------------

# The java implementation to use.
#export JAVA_HOME=${JAVA_HOME}

export JAVA_HOME=/home/hadoop/app/jdk1.8.0_91 

vi ~/.bash_profile

export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0-cdh5.15.1
export PATH=$HADOOP_HOME/bin:$PATH

cd $HADOOP_HOME/bin

  • 目录
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ ls
bin             etc                  include  LICENSE.txt  README.txt  src
bin-mapreduce1  examples             lib      logs         sbin
cloudera        examples-mapreduce1  libexec  NOTICE.txt   share
目录用途
binhadoop客户端名单
etc/hadoophadoop相关的配置文件存放目录
sbin启动hadoop相关进程的脚本
share常用的例子

HDFS格式化以及启动详解

archive.cloudera.com/cdh5/cdh/5/…

vi etc/hadoop/core-site.xml:

说明这个主节点再这台机器上的8020端口

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop000:8020</value>
    </property>
</configuration>

vi etc/hadoop/hdfs-site.xml:

<configuration>


    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/app/tmp</value>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>


</configuration>

vi slaves

第一次要执行格式化文件系统,不重复执行: hdfs namenode -format

cd $HADOOP_HOME/bin

相关命令再这里cd $HADOOP_HOME/bin

  • 启动集群$HADOOP_HOME/sbin/start-dfs.sh

验证成功

[hadoop@hadoop000 sbin]$ jps
13607 NameNode
14073 Jps
13722 DataNode
13915 SecondaryNameNode
  • 防火墙干扰

http://192.168.1.200:50070 发现jps可以打开但浏览器不行,多半是防火墙

查看防火墙 firewall-cmd --state 关防火墙systemctl stop firewalld.service

[hadoop@hadoop000 sbin]$ firewall-cmd --state
not running

  • 停止集群 $HADOOP_HOME/sbin/stop-dfs.sh

  • 注意

tart-dfs. sh等于

hadoop-daemons.sh start namenode
hadoop-daemons.sh start datanode
hadoop-daemons.sh start secondarynamenode

同理stop-dfs.sh也是

Hadoop命令行操作详解

改了环境变量记得 source ~/.bash_profile

[hadoop@hadoop000 bin]$ ./hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  s3guard              manage data on S3
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.


[hadoop@hadoop000 bin]$ ./hadoop fs
Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] [-v] [-x] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] [-x] <path> ...]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touchz <path> ...]
	[-usage [cmd ...]]

  • 常用命令 hadoop fs -ls / hadoop fs -cat /==hadoop fs -text / hadoop fs -put /==hadoop fs -copyFromLocal / hadoop fs -get /README.txt ./ hadoop fs -mkdir /hdfs-test hadoop fs -mv hadoop fs -rm hadoop fs -rmdir hadoop fs -rmr==hadoop fs -rm -r hadoop fs -getmerge hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -put README.txt  /
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 1 items
-rw-r--r--   1 hadoop supergroup       1366 2020-08-17 21:35 /README.txt

[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -cat /README.txt

......
and our wiki, at:
......
  Hadoop Core uses the SSL libraries from the Jetty project written 
by mortbay.org.

[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -get /README.txt ./


[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -mkdir /hdfs-test
[hadoop@hadoop000 hadoop-2.6.0-cdh5.15.1]$ hadoop fs -ls /
Found 2 items
-rw-r--r--   1 hadoop supergroup       1366 2020-08-17 21:35 /README.txt
drwxr-xr-x   - hadoop supergroup          0 2020-08-17 21:48 /hdfs-test

HDFS的存储扩展

上图我们可以看到一个文件被拆了两个块,但是实际存储的在哪里呢?

由此我们得出 put,1个文件分割成n个块,然后再存放再不同的节点的 get,先去n个节点上的n个块上找到对应的数据信息

在这里插入图片描述