Hive自定义函数实现Json数据解析UDF开发（作业）

科技2022-07-13 134

参考文章：https://www.pianshen.com/article/931082274/

我做的工作：完全照猫画虎，一步步操作下来，写出完整的代码。

Json数据解析UDF开发（作业）

作业：

有原始json数据如下：

{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}

{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}

{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}

{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}

{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}

需要将数据导入到hive数据仓库中

要求：我不管你中间用几个表，最终我要得到一个结果表：

movie

rate

timestamp

uid

1197

978302268

老师给出的思路：

注：全在hive中完成，可以用自定义函数

第一步：自定义udf函数，将我们json数据给解析出来，解析成四个字段，整成一个\t分割的一行

第二步：注册我们的自定义函数

第三步：创建一个临时表，加载json格式的数据，加载到临时表里面的一个字段里面去

第四步：insert overwrite local directory 将临时表当中的数据通过我们的自定义函数，给查询出来，放到本地路径下面去

第五步：通过load data的方式，将我们得数据加载到新表当中去

学生作业：

一、数据（这一步千万不要做错）：

{"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"}

#创建t_json表

0: jdbc:hive2://localhost:10000> create table t_json(line string) format delimited;

#导入数据

0: jdbc:hive2://localhost:10000> load data local inpath '/export/datas/hivedata/data.json' into table t_json;

导入后结果：

0: jdbc:hive2://node03.hadoop.com:10000> select * from t_json;

+----------------------------------------------------+--+

| t_json.line |

+----------------------------------------------------+--+

| {"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} |

| {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} |

| {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} |

| {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} |

| {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} |

| {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} |

| {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} |

| {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} |

| {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} |

| {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} |

| {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} |

| {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} |

| {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} |

| {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"} |

+----------------------------------------------------+--+

14 rows selected (0.047 seconds)

验证一下数据是否符合要求：

0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;

+--------+--+

| movie |

+--------+--+

| 2717 |

| 2571 |

| 1917 |

| 2396 |

| 3735 |

| 1953 |

| 1597 |

| 3809 |

| 1954 |

| 1955 |

+--------+--+

10 rows selected (0.05 seconds)

0: jdbc:hive2://node03.hadoop.com:10000>

以上数据初始化完成。

二、自定义UDF函数源代码

1、创建java项目。不介绍，项目创建好后，可以加入pom依赖，或者直接把hive依赖的jar包拷贝项目里，

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <parent> <artifactId>bigdata19</artifactId> <groupId>cn.itcast.bigdata19</groupId> <version>1.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <artifactId>day06_hiveudf</artifactId>  <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <dependencies>  <dependency> <groupId>org.json</groupId> <artifactId>json</artifactId> <version>20190722</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.8.0</version>  </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version>    </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.0</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin>  <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*/RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin> </plugins> </build></project>

2、创建java类JsonParser

package cn.itcast.hive.udf;import org.apache.hadoop.hive.ql.exec.UDF;import parquet.org.codehaus.jackson.map.ObjectMapper;import java.io.IOException;public class JsonParser extends UDF { ObjectMapper objectMapper = new ObjectMapper(); /* public String evaluate(String jsonLine){ MovieRateBean readValue; try{ readValue = objectMapper.readValue(jsonLine, MovieRateBean.class); return readValue.toString(); }catch(IOException e){ //e.printStackTrace(); } return ""; }*/ ///* public String evaluate(String jsonLine){ if(null == jsonLine){ return null; } if(jsonLine.trim() == ""){ return ""; } MovieRateBean readValue; try{ readValue = objectMapper.readValue(jsonLine, MovieRateBean.class); return readValue.toString(); }catch(IOException e){ e.printStackTrace(); } return ""; } //*/ public static void main(String[] args) { //{“movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} String jsonStr = "{\"movie\":\"1287\",\"rate\":\"5\",\"timeStamp\":\"978302039\",\"uid\":\"1\"}"; JsonParser jsonParser = new JsonParser(); String result = jsonParser.evaluate(jsonStr); System.out.println(result); } }

3、创建bean对象：

package cn.itcast.hive.udf;public class MovieRateBean { private String movie; private String rate; private String timeStamp; private String uid; @Override public String toString() { return (movie + ',' + rate + ',' + timeStamp + ',' + uid); } public String getMovie() { return movie; } public void setMovie(String movie) { this.movie = movie; } public String getRate() { return rate; } public void setRate(String rate) { this.rate = rate; } public String getTimeStamp() { return timeStamp; } public void setTimeStamp(String timeStamp) { this.timeStamp = timeStamp; } public String getUid() { return uid; } public void setUid(String uid) { this.uid = uid; } }

4、打包，放到hive所在的服务器。

[root@node03 lib]# rz -E

rz waiting to receive.

[root@node03 lib]# mv day06_hiveudf-1.0-SNAPSHOT.jar JsonParser.jar

[root@node03 lib]# pwd

/export/servers/hive-1.1.0-cdh5.14.0/lib

[root@node03 lib]#

5、执行命令，把jar包加入classpath里：

0: jdbc:hive2://node03.hadoop.com:10000> add jar /export/servers/hive-1.1.0-cdh5.14.0/lib/JsonParser.jar;

6、创建函数：

0: jdbc:hive2://node03.hadoop.com:10000> create temporary function parseJson as 'cn.itcast.hive.udf.JsonParser';

7、测试：

0: jdbc:hive2://node03.hadoop.com:10000> select parseJson(line) from t_json;

+---------------------+--+

| _c0 |

+---------------------+--+

| 2717,3,978298196,2 |

| 2571,4,978299773,2 |

| 1917,3,978300174,2 |

| 2396,4,978299641,2 |

| 3735,3,978298814,2 |

| 1953,4,978298775,2 |

| 1597,3,978300025,2 |

| 3809,3,978299712,2 |

| 1954,5,978298841,2 |

| 1955,4,978299200,2 |

| 235,3,978299351,2 |

| 1124,5,978299418,2 |

| 1957,5,978298750,2 |

| 163,4,978299809,2 |

+---------------------+--+

14 rows selected (0.358 seconds)

到这里，自定义函数创建成功。

三、在hive中进行数据切割和处理

跟进逗号进行分割：

0: jdbc:hive2://node03.hadoop.com:10000> select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;

+--------+-------+------------+-----+--+

+--------+-------+------------+-----+--+

| 2717 | 3 | 978298196 | 2 |

| 2571 | 4 | 978299773 | 2 |

| 1917 | 3 | 978300174 | 2 |

| 2396 | 4 | 978299641 | 2 |

| 3735 | 3 | 978298814 | 2 |

| 1953 | 4 | 978298775 | 2 |

| 1597 | 3 | 978300025 | 2 |

| 3809 | 3 | 978299712 | 2 |

| 1954 | 5 | 978298841 | 2 |

| 1955 | 4 | 978299200 | 2 |

| 235 | 3 | 978299351 | 2 |

| 1124 | 5 | 978299418 | 2 |

| 1957 | 5 | 978298750 | 2 |

| 163 | 4 | 978299809 | 2 |

+--------+-------+------------+-----+--+

14 rows selected (0.064 seconds)

查询的结果保存到hive表里：

0: jdbc:hive2://node03.hadoop.com:10000> create table t_rate as select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;

No rows affected (13.179 seconds)

0: jdbc:hive2://node03.hadoop.com:10000> select * from t_rate;

+---------------+--------------+-------------------+------------+--+

+---------------+--------------+-------------------+------------+--+

| 2717 | 3 | 978298196 | 2 |

| 2571 | 4 | 978299773 | 2 |

| 1917 | 3 | 978300174 | 2 |

| 2396 | 4 | 978299641 | 2 |

| 3735 | 3 | 978298814 | 2 |

| 1953 | 4 | 978298775 | 2 |

| 1597 | 3 | 978300025 | 2 |

| 3809 | 3 | 978299712 | 2 |

| 1954 | 5 | 978298841 | 2 |

| 1955 | 4 | 978299200 | 2 |

| 235 | 3 | 978299351 | 2 |

| 1124 | 5 | 978299418 | 2 |

| 1957 | 5 | 978298750 | 2 |

| 163 | 4 | 978299809 | 2 |

+---------------+--------------+-------------------+------------+--+

14 rows selected (0.094 seconds)

以上是通过自定义hive函数来解决的方式，其实hive有一个内置函数，接下来演示下内置函数：

0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;

+--------+--+

| movie |

+--------+--+

| 2717 |

| 2571 |

| 1917 |

| 2396 |

| 3735 |

| 1953 |

| 1597 |

| 3809 |

| 1954 |

| 1955 |

+--------+--+

10 rows selected (0.149 seconds)

0: jdbc:hive2://node03.hadoop.com:10000>

内置的函数只能实现简单的json格式，复杂的json格式还是要定义函数来实现。

Processed: 0.013, SQL: 8