Hive学习(三)Hive导入数据的几种方式

Hive导入数据的几种方式

  • 从本地文件系统中导入数据到Hive表
  • 从HDFS中导入数据到Hive表

上面的两种方式都是使用Hive的load语句导入数据的,具体格式如下:

1
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
  • 如果使用了LOCAL关键字,则会在本地文件系统中寻找filepath,如果filepath是相对路径,则该路径会被解释为相对于用户的当前工作目录,用户也可以指定为本地文件指定完整URI,例如:file:///data/track.log,或者直接写为/data/track.log。Load语句将会复制由filepath指定的所有文件到目标文件系统(目标文件系统由表的location属性推断得出),然后移动文件到表中。

  • 如果未使用LOCAL关键字,filepath必须指的是与目标表的location文件系统相同的文件系统上的文件(例如:HDFS文件系统)。这里Load的本质实际就是一个HDFS目录下的数据文件转移到另一个HDFS目录下的操作。

当然还有其他的Hive导入数据的方式,但这里我们重点介绍这两种,其他的导入数据方式可以参考:https://www.iteblog.com/archives/949

下面我们将具体举例分析上面两种Hive导入数据的方式,下面是我们要分析的日志文件track.log的内容

1
{"logs":[{"timestamp":"1475912701768","rpid":"63146996042563584","name":"birdben.ad.click_ad","bid":0,"uid":0,"did":0,"duid":0,"hb_uid":0,"ua":"","device_id":"","server_timestamp":1475912715001}],"level":"info","message":"logs","timestamp":"2016-10-08T07:45:15.001Z"}

从本地文件系统中导入数据到Hive表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 查看需要导入Hive的track.log文件内容
$ cat /data/track.log {"logs":[{"timestamp":"1475912701768","rpid":"63146996042563584","name":"birdben.ad.click_ad","bid":0,"uid":0,"did":0,"duid":0,"hb_uid":0,"ua":"","device_id":"","server_timestamp":1475912715001}],"level":"info","message":"logs","timestamp":"2016-10-08T07:45:15.001Z"}
# 创建表test_local_table
hive> CREATE TABLE IF NOT EXISTS test_local_table(logs array<struct<name:string, rpid:string, bid:string, uid:string, did:string, duid:string, hbuid:string, ua:string, device_id:string, ip:string, server_timestamp:BIGINT>>, level STRING, message STRING, client_timestamp BIGINT)
partitioned by (dt string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
# 从本地文件系统中导入数据到Hive表
hive> load data local inpath '/data/track.log' into table test_local_table partition (dt='2016-10-18');
# 导入完成之后,查询test_local_table表中的数据
hive> select * from test_local_table; OK [{"name":"birdben.ad.click_ad","rpid":"63146996042563584","bid":"0","uid":"0","did":"0","duid":"0","hbuid":null,"ua":"","device_id":"","ip":null,"server_timestamp":1475912715001}] info logs NULL 2016-10-18 Time taken: 0.106 seconds, Fetched: 1 row(s)
# 在HDFS的/hive/warehouse目录中查看track.log文件,这就是我们将本地系统文件导入到Hive之后,存储在HDFS的路径
# test_hdfs.db是我们的数据库
# test_local_table是我们创建的表
# dt=2016-10-18是我们创建的Partition
$ hdfs dfs -ls /hive/warehouse/test_hdfs.db/test_local_table/dt=2016-10-18 Found 1 items -rwxr-xr-x 2 yunyu supergroup 268 2016-10-17 21:19 /hive/warehouse/test_hdfs.db/test_local_table/dt=2016-10-18/track.log
# 查看文件内容
$ hdfs dfs -cat /hive/warehouse/test_hdfs.db/test_local_table/dt=2016-10-18/track.log {"logs":[{"timestamp":"1475912701768","rpid":"63146996042563584","name":"birdben.ad.click_ad","bid":0,"uid":0,"did":0,"duid":0,"hb_uid":0,"ua":"","device_id":"","server_timestamp":1475912715001}],"level":"info","message":"logs","timestamp":"2016-10-08T07:45:15.001Z"}

从HDFS中导入数据到Hive表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 在HDFS中查看要导入到Hive的文件(这里我们使用之前Flume收集到HDFS的track.log的日志文件)
$ hdfs dfs -ls /flume/events/birdben.ad.click_ad/201610/
Found 1 items -rw-r--r-- 2 yunyu supergroup 6776 2016-10-13 06:18 /flume/events/birdben.ad.click_ad/201610/events-.1476364421957
# 创建表test_partition_table
hive> CREATE TABLE IF NOT EXISTS test_partition_table(logs array<struct<name:string, rpid:string, bid:string, uid:string, did:string, duid:string, hbuid:string, ua:string, device_id:string, ip:string, server_timestamp:BIGINT>>, level STRING, message STRING, client_timestamp BIGINT)
partitioned by (dt string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
# 从HDFS导入数据到Hive表
hive> load data inpath '/flume/events/birdben.ad.click_ad/201610/events-.1476364421957' into table test_partition_table partition (dt='2016-10-18');
# 导入完成之后,查询test_partition_table表中的数据
hive> select * from test_partition_table; OK [{"name":"birdben.ad.click_ad","rpid":"59948935480868864","bid":null,"uid":"0","did":"0","duid":"0","hbuid":null,"ua":"","device_id":"","ip":null,"server_timestamp":1475150396804}] info logs NULL 2016-10-18 [{"name":"birdben.ad.click_ad","rpid":"59948935480868864","bid":null,"uid":"0","did":"0","duid":"0","hbuid":null,"ua":"","device_id":"","ip":null,"server_timestamp":1475150470244}] info logs NULL 2016-10-18
...
Time taken: 0.102 seconds, Fetched: 26 row(s)
# 在HDFS中再次查看源文件,此时源文件已经在此目录下不存在了,因为已经被移动到/hive/warehouse下,所以说使用load从HDFS中导入数据到Hive的方式,是将原来HDFS文件移动到Hive默认配置的数据仓库下(即:/hive/warehouse下,此目录是在hive-site.xml配置文件中配置的)
$ hdfs dfs -ls /flume/events/rp.hb.ad.view_ad/201610
# 查看Hive默认配置的数据仓库的HDFS目录下,即可找到我们导入的文件
$ hdfs dfs -ls /hive/warehouse/test_hdfs.db/test_partition_table/dt=2016-10-18 Found 1 items -rwxr-xr-x 2 yunyu supergroup 6776 2016-10-13 06:18 /hive/warehouse/test_hdfs.db/test_partition_table/dt=2016-10-18/events-.1476364421957

原文链接: