(第一套)ZZ052大数据应用与服务赛项赛题

模块一:平台搭建与运维

(一)任务一:大数据平台搭建

本模块需要使用 root 用户完成相关配置;所有组件均

在/root/software 目录下。

前置准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 三台机都要
[root@localhost ~]# systemctl stop firewalld
[root@localhost ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.

[root@localhost ~]# vi /etc/hosts
# 追加:
192.168.1.122 master
192.168.1.123 slave1
192.168.1.124 slave2

[root@localhost ~]# ssh-keygen # 一直回车即可
[root@localhost ~]# ssh-copy-id master # yes 然后密码
[root@localhost ~]# ssh-copy-id slave1
[root@localhost ~]# ssh-copy-id slave2

# 改主机名
[root@localhost software]# hostnamectl set-hostname master # 以此类推第二台slave1第三台slave2

1.子任务一:基础环境准备

master、slave1、slave2三台节点都需要安装JDK

(1) 将JDK安装包解压到/root/software目录下;

1
2
3
[root@localhost software]# tar -zxvf /opt/software/jdk-8u391-linux-x64.tar.gz -C /root/software/
[root@localhost software]# cd /root/software/
[root@localhost software]# mv jdk1.8.0_391/ jdk

(2) 在“/etc/profile”文件中配置JDK环境变量

1
2
3
4
5
[root@localhost software]# vi /etc/profile
# 追加
export JAVA_HOME=/root/software/jdk
export PATH=$PATH:$JAVA_HOME/bin
[root@localhost software]# source /etc/profile

JAVA_HOME和PATH的值,并让配置文件立即生效;

(3) 查看JDK版本,检测JDK是否安装成功。

1
2
3
4
5
6
7
8
[root@localhost software]# java -version
java version "1.8.0_391"
Java(TM) SE Runtime Environment (build 1.8.0_391-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.391-b13, mixed mode)

# 记得分发jdk给slave1和slave2
[root@master software]# scp -r jdk/ slave1:`pwd`
[root@master software]# scp -r jdk/ slave2:`pwd`

在master节点操作

(1) 在master上生成SSH密钥对;

1
前置工作已做

(2) 将master上的公钥拷贝到slave1和slave2上;在 master 上通过 SSH 连接 slave1 和 slave2 来验证。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
[root@localhost software]# ssh slave1
Last login: Mon Oct 28 11:43:50 2024 from 192.168.1.166
[root@slave1 ~]# exit;
登出
Connection to slave1 closed.
[root@localhost software]# ssh slave1
Last login: Mon Oct 28 11:58:39 2024 from master
[root@slave1 ~]# exit
登出
Connection to slave1 closed.
[root@localhost software]# ssh slave2
Last login: Mon Oct 28 11:43:52 2024 from 192.168.1.166
[root@slave2 ~]# exit
登出
Connection to slave2 closed.

2.子任务二:Hadoop 完全分布式安装配置

master、slave1、slave2三台节点都需要安装Hadoop

(1) 在 主 节 点 将 Hadoop 安 装 包 解 压 到/root/software目录下;

1
tar -zxvf /opt/software/hadoop-3.3.6.tar.gz -C /root/software/

(2) 依次配置hadoop-env.sh、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml和workers配置

文件;Hadoop集群部署规划如下表;

表1 Hadoop集群部署规划

(3) 在master节点的Hadoop安装目录下依次创建hadoopDatas/tempDatas 、 hadoopDatas/namenodeDatas 、hadoopDatas/datanodeDatas、hadoopDatas/dfs/nn/edits、hadoopDatas/dfs/snn/name 和hadoopDatas/dfs/nn/snn/edits目录;

1
2
3
4
5
6
7
[root@localhost hadoop]# mkdir -p hadoopDatas/tempDatas
[root@localhost hadoop]# mkdir hadoopDatas/namenodeDatas
[root@localhost hadoop]# mkdir hadoopDatas/datanodeDatas
[root@localhost hadoop]# mkdir hadoopDatas/dfs/nn/edits
[root@localhost hadoop]# mkdir -p hadoopDatas/dfs/nn/edits
[root@localhost hadoop]# mkdir -p hadoopDatas/dfs/snn/name
[root@localhost hadoop]# mkdir -p hadoopDatas/dfs/nn/snn/edits

(4) 在master节点上使用scp命令将配置完的Hadoop安装目录直接拷贝至slave1和slave2;

1
2
scp -r /root/software/hadoop/ slave1:`pwd`
scp -r /root/software/hadoop/ slave2:`pwd`

(5) 三台节点的“/etc/profile”文件中配置Hadoop环境变量HADOOP_HOME和PATH的值,并让配置文件立即生效;

1
2
3
4
5
6
7
8
9
[root@master software]# scp /etc/profile slave1:/etc
profile 100% 1985 3.4MB/s 00:00
[root@master software]# scp /etc/profile slave2:/etc
profile 100% 1985 3.4MB/s 00:00

# slave1
[root@slave1 ~]# source /etc/profile
# slave2
[root@slave2 ~]# source /etc/profile

(6) 在主节点格式化集群;

1
[root@master software]# hadoop namenode -format

(7) 在主节点依次启动HDFS、YARN集群和历史服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@master /]# start-all.sh
Starting namenodes on [master]
上一次登录:一 10月 28 16:05:13 CST 2024pts/0 上
Starting datanodes
上一次登录:一 10月 28 16:05:25 CST 2024pts/0 上
Starting secondary namenodes [master]
上一次登录:一 10月 28 16:05:27 CST 2024pts/0 上
Starting resourcemanager
上一次登录:一 10月 28 16:05:33 CST 2024pts/0 上
Starting nodemanagers
上一次登录:一 10月 28 16:05:40 CST 2024pts/0 上
上一次登录:一 10月 28 16:05:42 CST 2024pts/0 上
[root@master /]# mapred --daemon start historyserver
[root@master /]# jps
70417 WebAppProxyServer
68435 NameNode
71267 JobHistoryServer
69218 SecondaryNameNode
71385 Jps
69560 ResourceManager
69816 NodeManager
68718 DataNode

3.子任务三:MySQL 安装配置

只在master节点操作

(1) 将MySQL 5.7.25安装包解压到/root/software目录下;

1
2
3
4
5
6
7
8
9
10
11
[root@master software]# tar -xvf /opt/software/mysql-5.7.44-1.el7.x86_64.rpm-bundle.tar -C /root/software/
mysql-community-client-5.7.44-1.el7.x86_64.rpm
mysql-community-common-5.7.44-1.el7.x86_64.rpm
mysql-community-devel-5.7.44-1.el7.x86_64.rpm
mysql-community-embedded-5.7.44-1.el7.x86_64.rpm
mysql-community-embedded-compat-5.7.44-1.el7.x86_64.rpm
mysql-community-embedded-devel-5.7.44-1.el7.x86_64.rpm
mysql-community-libs-5.7.44-1.el7.x86_64.rpm
mysql-community-libs-compat-5.7.44-1.el7.x86_64.rpm
mysql-community-server-5.7.44-1.el7.x86_64.rpm
mysql-community-test-5.7.44-1.el7.x86_64.rpm

(2) 使 用 rpm -ivh 依 次 安 装 mysql-community-common、mysql-community- libs、mysql-community-libscompat 、 mysql-community-client 和 mysql-communityserver包;

1
2
3
4
5
[root@master software]# rpm -ivh mysql-community-common-5.7.44-1.el7.x86_64.rpm
[root@master software]# rpm -ivh mysql-community-libs-5.7.44-1.el7.x86_64.rpm
[root@master software]# rpm -ivh mysql-community-libs-compat-5.7.44-1.el7.x86_64.rpm
[root@master software]# rpm -ivh mysql-community-client-5.7.44-1.el7.x86_64.rpm
[root@master software]# rpm -ivh mysql-community-server-5.7.44-1.el7.x86_64.rpm

(3) 安装好MySQL后,使用mysql用户初始化和启动数据库;

1
2
[root@master software]# mysqld --initialize-insecure --user=mysql --datadir=/var/lib/mysql
[root@master software]# systemctl start mysqld

(4) 使用root用户无密码登录MySQL,然后将root用户的密码修改为123456,修改完成退出MySQL,重新登录验证密码是否修改成功;更改“mysql”数据库里的 user 表里的 host 项,从localhost 改成%即可实现用户远程登录;设置完成刷新配置信息,让其生效。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
[root@master software]# mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.44 MySQL Community Server (GPL)

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> alter user 'root'@'localhost' identified by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> exit;
Bye
[root@master software]# mysql -uroot -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.7.44 MySQL Community Server (GPL)

Copyright (c) 2000, 2023, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select user,host from mysql.user;
+---------------+-----------+
| user | host |
+---------------+-----------+
| mysql.session | localhost |
| mysql.sys | localhost |
| root | localhost |
+---------------+-----------+
3 rows in set (0.00 sec)

mysql> update mysql.user set host='%' where user='root';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> update mysql.user set host='%' where user='mysql.sys';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> update mysql.user set host='%' where user='mysql.session';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0

mysql> flush privileges;
Query OK, 0 rows affected (0.00 sec)

4.子任务四:Hive 安装配置

只在master节点操作。

(1) 将Hive 3.1.2的安装包解压到/root/software目录下;

1
[root@master software]# tar -zxvf /opt/software/apache-hive-3.1.3-bin.tar.gz -C /root/software/

(2) 在“/etc/profile”文件中配置Hive环境变量HIVE_HOME和PATH的值,并让配置文件立即生效;

1
2
3
4
5
6
7
[root@master software]# cd /root/software/
[root@master software]# mv apache-hive-3.1.3-bin/ hive
[root@master software]# vi /etc/profile
# 追加
export HIVE_HOME=/root/software/hive
export PATH=$PATH:$HIVE_HOME/bin
[root@master software]# source /etc/profile

(3) 查看Hive版本,检测Hive环境变量是否设置成功;

1
2
3
4
5
6
7
8
9
10
[root@master software]# hive --version
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/root/software/hive/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/root/software/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive 3.1.3
Git git://MacBook-Pro.fios-router.home/Users/ngangam/commit/hive -r 4df4d75bf1e16fe0af75aad0b4179c34c07fc975
Compiled by ngangam on Sun Apr 3 16:58:16 EDT 2022
From source with checksum 5da234766db5dfbe3e92926c9bbab2af

(4) 切换到 $HIVE_HOME/conf 目录下,将 hiveenv.sh.template文件复制一份并重命名为hive-env.sh;然后,使用vim编辑器进行编辑,在文件中配置HADOOP_HOME、HIVE_CONF_DIR以及HIVE_AUX_JARS_PATH参数的值,将原有值删除并将前面的注释符#去掉;配置完成,保存退出;

1
2
3
4
5
[root@master conf]# cp hive-env.sh.template hive-env.sh
# 配置这些值
HADOOP_HOME=/root/software/hadoop
export HIVE_CONF_DIR=/root/software/hive
export HIVE_AUX_JARS_PATH=/root/software/hive/lib

(5) 将 /root/software 目 录 下 的 MySQL 驱动包mysql-connector-java-5.1.47-bin.jar 拷 贝 到$HIVE_HOME/lib目录下;

1
[root@master conf]# cp /opt/software/mysql-connector-java-5.1.34.jar /root/software/hive/lib/

(6) 在$HIVE_HOME/conf目录下创建一个名为hive-site.xml的文件,并使用vim编辑器进行编辑;

1
2
[root@master conf]# vi hive-site.xml
# 然后进行配置

配置如下内容:

(7) 使用schematool命令,通过指定元数据库类型为“mysql”,来初始化源数据库的元数据;

1
[root@master conf]# schematool -initSchema -dbType mysql

(8) 使用CLI启动Hive,进入Hive客户端;在Hive默认数据库下创建一个名为student的管理表;

1
2
3
4
5
6
[root@master conf]# hive
hive> create table student(
> id int,
> name string);
OK
Time taken: 1.73 seconds

(9) 通过insert语句往student表中插入一条测试数据。

1
hive> insert into table student values(1, 'hkj');

5.子任务五:Flume 安装配置

只在 master 节点操作。

(1) 将 Flume 1.11.0 的安装包解压到/root/software目录下;

1
[root@master software]# tar -zxvf /opt/software/apache-flume-1.11.0-bin.tar.gz -C /root/software/

(2) 在“/etc/profile”文件中配置Flume环境变量FLUME_HOME和PATH的值,并让配置文件立即生效;

1
2
3
4
5
6
7
[root@master software]# cd /root/software/
[root@master software]# mv apache-flume-1.11.0-bin/ flume
[root@master software]# vi /etc/profile
# 追加
export FLUME_HOME=/root/software/flume
export PATH=$PATH:$FLUME_HOME/bin
[root@master software]# source /etc/profile

(3) 使 用 cd 命 令 进 入 /root/software/apache/flume-1.11.0-bin/conf 目 录 下 , 使 用 cp 命令将 flumeenv.sh.template文件复制一份,并重命名为flume-env.sh;使 用 vim 命 令 打 开 “flume-env.sh” 配 置 文 件 , 找 到JAVA_HOME参数位置,将前面的“#”去掉,将值修改为本机JDK的实际位置;修改完成,保存退出;

1
2
3
4
[root@master conf]# cd /root/software/flume/conf/
[root@master conf]# cp flume-env.sh.template flume-env.sh
# 将java_home配置项的#去掉然后更换地址
# 例:export JAVA_HOME=/root/software/jdk

(4) 查看Flume版本,检测Flume是否安装成功。

1
2
3
4
5
6
[root@master conf]# flume-ng version
Flume 1.11.0
Source code repository: https://git.apache.org/repos/asf/flume.git
Revision: 1a15927e594fd0d05a59d804b90a9c31ec93f5e1
Compiled by rgoers on Sun Oct 16 14:44:15 MST 2022
From source with checksum bbbca682177262aac3a89defde369a37

(二)任务二:数据库配置维护

1.子任务一:数据库配置

在 Hive 中创建一个名为 comm 的数据库,如果数据库已经存在,则不进行创建。

1
2
3
4
5
6
7
[root@master conf]# hive
hive> create database if not exist comm;
FAILED: ParseException line 1:23 missing KW_EXISTS at 'exist' near '<EOF>'
line 1:29 extraneous input 'comm' expecting EOF near '<EOF>'
hive> create database if not exists comm;
OK
Time taken: 0.297 seconds

2.子任务二:创建相关表

(1) 在 comm 数 据 库 下 创 建 一 个 名 为ods_behavior_log的外部表,如果表已存在,则先删除;分区字段为dt,即根据日期进行分区;同时,使用location关键 字 将 表 的 存 储 路 径 设 置 为 HDFS 的/behavior/ods/ods_behavior_log目录;字段类型如下表所示;

1
2
3
4
5
6
7
8
9
10
11
12
hive> use comm;
OK
Time taken: 0.042 seconds
hive> CREATE TABLE IF NOT EXISTS ods_behavior_log (
> line STRING
> )
> PARTITIONED BY (dt STRING)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
OK
Time taken: 0.244 seconds

(2) 使 用 load data 子 句 将 本 地/root/eduhq/data/app_log/behavior目录下的每个数据文件依次加载到外部表ods_behavior_log的对应分区中,按照日志文件对应日期定义静态分区(例如:dt=’2023-01-01’)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-01.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-01');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 1.831 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-02.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-02');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.628 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-03.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-03');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.491 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-04.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-04');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.506 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-05.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-05');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.465 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-06.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-06');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.44 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/excel_log/behavior2023-01-07.xlsx' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-07');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.423 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-01.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-01');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.432 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-02.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-02');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.397 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-03.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-03');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.425 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-04.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-04');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.43 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-05.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-05');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.395 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-06.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-06');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.422 seconds
hive> LOAD DATA LOCAL INPATH '/root/eduhq/data/app_log/behavior/txt_log/behavior2023-01-07.log' INTO TABLE ods_behavior_log PARTITION (dt='2023-01-07');
Loading data to table comm.ods_behavior_log partition (dt=2023-01-01)
OK
Time taken: 0.773 seconds

(3) 查看ods_behavior_log表的所有现有分区、前3行数据,并统计外部表ods_behavior_log数据总行数;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# 查看所有现有分区
hive> SHOW PARTITIONS ods_behavior_log;
OK
dt=2023-01-01
dt=2023-01-02
dt=2023-01-03
dt=2023-01-04
dt=2023-01-05
dt=2023-01-06
dt=2023-01-07
Time taken: 0.089 seconds, Fetched: 7 row(s)

# 前三行数据
hive> SELECT * FROM ods_behavior_log LIMIT 3;
OK
{"client_ip": "222.86.237.202", "device_type": "pc", "time": 1672580908000, "type": "WIFI", "device": "357c5dd65e834b8bafac861bdae5d76d", "url": "http://wan.baidu.com/home?idfrom=4087"} 2023-01-01
{"client_ip": "210.42.113.5", "device_type": "pc", "time": 1672539082000, "type": "WIFI", "device": "6cab446d0c664de4b09c76bae90fc1bf", "url": "http://go.hao123.com/sites"} 2023-01-01
{"client_ip": "222.34.42.67", "device_type": "mobile", "time": 1672524854000, "type": "5G", "device": "36b4213510e74c7e8455b61fb481b576", "url": "https://4366yy.381pk.com/1251/"} 2023-01-01
Time taken: 1.805 seconds, Fetched: 3 row(s)

# 统计数据总行数
hive> SELECT COUNT(*) FROM ods_behavior_log;
Query ID = root_20241028190258_2de0fddd-343e-4256-b248-f252f3400a9b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1730102747884_0003, Tracking URL = http://master:8089/proxy/application_1730102747884_0003/
Kill Command = /root/software/hadoop/bin/mapred job -kill job_1730102747884_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2024-10-28 19:03:31,071 Stage-1 map = 0%, reduce = 0%
2024-10-28 19:03:42,782 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.55 sec
2024-10-28 19:04:43,115 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.55 sec
2024-10-28 19:05:43,251 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.55 sec
2024-10-28 19:06:44,246 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.55 sec
2024-10-28 19:06:53,529 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.32 sec
MapReduce Total cumulative CPU time: 4 seconds 320 msec
Ended Job = job_1730102747884_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.32 sec HDFS Read: 37824491 HDFS Write: 106 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 320 msec
OK
198413
Time taken: 235.981 seconds, Fetched: 1 row(s)

(4) 在 comm 数 据 库 下 创 建 一 个 名 为dwd_behavior_log的外部表,如果表已存在,则先删除;分区字段为dt,即根据日期进行分区;另外,要求指定表的存储路径为HDFS的/behavior/dwd/dwd_behavior_log目录,存储文件类型为“orc”,文件的压缩类型为“snappy”;字段类型如下表所示;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
hive> CREATE EXTERNAL TABLE IF NOT EXISTS comm.dwd_behavior_log (
> client_ip STRING,
> device_type STRING,
> type STRING,
> device STRING,
> url STRING,
> province STRING,
> city STRING,
> ts BIGINT
> )
> PARTITIONED BY (dt STRING)
> STORED AS ORC
> LOCATION '/behavior/dwd/dwd_behavior_log'
> TBLPROPERTIES ("orc.compress"="SNAPPY");
OK
Time taken: 0.091 seconds