参考文章:https://www.pianshen.com/article/931082274/
我做的工作:完全照猫画虎,一步步操作下来,写出完整的代码。
Json数据解析UDF开发(作业)
作业:
有原始json数据如下:
{"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}
{"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}
{"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}
{"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}
{"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}
{"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}
{"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}
需要将数据导入到hive数据仓库中
要求:我不管你中间用几个表,最终我要得到一个结果表:
movie
rate
timestamp
uid
1197
3
978302268
1
老师给出的思路:
注:全在hive中完成,可以用自定义函数
第一步:自定义udf函数,将我们json数据给解析出来,解析成四个字段,整成一个\t分割的一行
第二步:注册我们的自定义函数
第三步:创建一个临时表,加载json格式的数据,加载到临时表里面的一个字段里面去
第四步:insert overwrite local directory 将临时表当中的数据通过我们的自定义函数,给查询出来,放到本地路径下面去
第五步:通过load data的方式,将我们得数据加载到新表当中去
学生作业:
一、数据(这一步千万不要做错):
{"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"}
#创建t_json表
0: jdbc:hive2://localhost:10000> create table t_json(line string) format delimited;
#导入数据
0: jdbc:hive2://localhost:10000> load data local inpath '/export/datas/hivedata/data.json' into table t_json;
导入后结果:
0: jdbc:hive2://node03.hadoop.com:10000> select * from t_json;
+----------------------------------------------------+--+
| t_json.line |
+----------------------------------------------------+--+
| {"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} |
| {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} |
| {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} |
| {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} |
| {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} |
| {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} |
| {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} |
| {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} |
| {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} |
| {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} |
| {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} |
| {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} |
| {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} |
| {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"} |
+----------------------------------------------------+--+
14 rows selected (0.047 seconds)
验证一下数据是否符合要求:
0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;
+--------+--+
| movie |
+--------+--+
| 2717 |
| 2571 |
| 1917 |
| 2396 |
| 3735 |
| 1953 |
| 1597 |
| 3809 |
| 1954 |
| 1955 |
+--------+--+
10 rows selected (0.05 seconds)
0: jdbc:hive2://node03.hadoop.com:10000>
以上数据初始化完成。
二、自定义UDF函数源代码
1、创建java项目。不介绍,项目创建好后,可以加入pom依赖,或者直接把hive依赖的jar包拷贝项目里,
<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <parent> <artifactId>bigdata19</artifactId> <groupId>cn.itcast.bigdata19</groupId> <version>1.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <artifactId>day06_hiveudf</artifactId> <!-- repositories定义我们去哪个仓库地址下载我们jar包 --> <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <dependencies> <!-- https://mvnrepository.com/artifact/org.json/json --> <dependency> <groupId>org.json</groupId> <artifactId>json</artifactId> <version>20190722</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.8.0</version> <!--<version>2.6.0-cdh5.14.0</version>--> </dependency> <dependency> <groupId>org.apache.hive</groupId> <artifactId>hive-exec</artifactId> <version>1.2.1</version> <!--<version>1.1.0-cdh5.14.0</version>--> <!--scope 取值一共有四个 compile:编译要,打包要 test : 测试时候要,打包不要 provided : 开发时候要,打包不要 默认:compile --> <!-- <scope>provided</scope>--> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.0</version> <configuration> <source>1.8</source> <target>1.8</target> <encoding>UTF-8</encoding> </configuration> </plugin> <!-- 打包的插件,会将我们用到的包都打进去 --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.2</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*/RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin> </plugins> </build></project>
2、创建java类JsonParser
package cn.itcast.hive.udf;import org.apache.hadoop.hive.ql.exec.UDF;import parquet.org.codehaus.jackson.map.ObjectMapper;import java.io.IOException;public class JsonParser extends UDF { ObjectMapper objectMapper = new ObjectMapper(); /* public String evaluate(String jsonLine){ MovieRateBean readValue; try{ readValue = objectMapper.readValue(jsonLine, MovieRateBean.class); return readValue.toString(); }catch(IOException e){ //e.printStackTrace(); } return ""; }*/ ///* public String evaluate(String jsonLine){ if(null == jsonLine){ return null; } if(jsonLine.trim() == ""){ return ""; } MovieRateBean readValue; try{ readValue = objectMapper.readValue(jsonLine, MovieRateBean.class); return readValue.toString(); }catch(IOException e){ e.printStackTrace(); } return ""; } //*/ public static void main(String[] args) { //{“movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"} String jsonStr = "{\"movie\":\"1287\",\"rate\":\"5\",\"timeStamp\":\"978302039\",\"uid\":\"1\"}"; JsonParser jsonParser = new JsonParser(); String result = jsonParser.evaluate(jsonStr); System.out.println(result); } }
3、创建bean对象:
package cn.itcast.hive.udf;public class MovieRateBean { private String movie; private String rate; private String timeStamp; private String uid; @Override public String toString() { return (movie + ',' + rate + ',' + timeStamp + ',' + uid); } public String getMovie() { return movie; } public void setMovie(String movie) { this.movie = movie; } public String getRate() { return rate; } public void setRate(String rate) { this.rate = rate; } public String getTimeStamp() { return timeStamp; } public void setTimeStamp(String timeStamp) { this.timeStamp = timeStamp; } public String getUid() { return uid; } public void setUid(String uid) { this.uid = uid; } }
4、打包,放到hive所在的服务器。
[root@node03 lib]# rz -E
rz waiting to receive.
[root@node03 lib]# mv day06_hiveudf-1.0-SNAPSHOT.jar JsonParser.jar
[root@node03 lib]# pwd
/export/servers/hive-1.1.0-cdh5.14.0/lib
[root@node03 lib]#
5、执行命令,把jar包加入classpath里:
0: jdbc:hive2://node03.hadoop.com:10000> add jar /export/servers/hive-1.1.0-cdh5.14.0/lib/JsonParser.jar;
6、创建函数:
0: jdbc:hive2://node03.hadoop.com:10000> create temporary function parseJson as 'cn.itcast.hive.udf.JsonParser';
7、测试:
0: jdbc:hive2://node03.hadoop.com:10000> select parseJson(line) from t_json;
+---------------------+--+
| _c0 |
+---------------------+--+
| 2717,3,978298196,2 |
| 2571,4,978299773,2 |
| 1917,3,978300174,2 |
| 2396,4,978299641,2 |
| 3735,3,978298814,2 |
| 1953,4,978298775,2 |
| 1597,3,978300025,2 |
| 3809,3,978299712,2 |
| 1954,5,978298841,2 |
| 1955,4,978299200,2 |
| 235,3,978299351,2 |
| 1124,5,978299418,2 |
| 1957,5,978298750,2 |
| 163,4,978299809,2 |
+---------------------+--+
14 rows selected (0.358 seconds)
到这里,自定义函数创建成功。
三、在hive中进行数据切割和处理
跟进逗号进行分割:
0: jdbc:hive2://node03.hadoop.com:10000> select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;
+--------+-------+------------+-----+--+
| movie | rate | timestamp | id |
+--------+-------+------------+-----+--+
| 2717 | 3 | 978298196 | 2 |
| 2571 | 4 | 978299773 | 2 |
| 1917 | 3 | 978300174 | 2 |
| 2396 | 4 | 978299641 | 2 |
| 3735 | 3 | 978298814 | 2 |
| 1953 | 4 | 978298775 | 2 |
| 1597 | 3 | 978300025 | 2 |
| 3809 | 3 | 978299712 | 2 |
| 1954 | 5 | 978298841 | 2 |
| 1955 | 4 | 978299200 | 2 |
| 235 | 3 | 978299351 | 2 |
| 1124 | 5 | 978299418 | 2 |
| 1957 | 5 | 978298750 | 2 |
| 163 | 4 | 978299809 | 2 |
+--------+-------+------------+-----+--+
14 rows selected (0.064 seconds)
查询的结果保存到hive表里:
0: jdbc:hive2://node03.hadoop.com:10000> create table t_rate as select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;
No rows affected (13.179 seconds)
0: jdbc:hive2://node03.hadoop.com:10000> select * from t_rate;
+---------------+--------------+-------------------+------------+--+
| t_rate.movie | t_rate.rate | t_rate.timestamp | t_rate.id |
+---------------+--------------+-------------------+------------+--+
| 2717 | 3 | 978298196 | 2 |
| 2571 | 4 | 978299773 | 2 |
| 1917 | 3 | 978300174 | 2 |
| 2396 | 4 | 978299641 | 2 |
| 3735 | 3 | 978298814 | 2 |
| 1953 | 4 | 978298775 | 2 |
| 1597 | 3 | 978300025 | 2 |
| 3809 | 3 | 978299712 | 2 |
| 1954 | 5 | 978298841 | 2 |
| 1955 | 4 | 978299200 | 2 |
| 235 | 3 | 978299351 | 2 |
| 1124 | 5 | 978299418 | 2 |
| 1957 | 5 | 978298750 | 2 |
| 163 | 4 | 978299809 | 2 |
+---------------+--------------+-------------------+------------+--+
14 rows selected (0.094 seconds)
以上是通过自定义hive函数来解决的方式,其实hive有一个内置函数,接下来演示下内置函数:
0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;
+--------+--+
| movie |
+--------+--+
| 2717 |
| 2571 |
| 1917 |
| 2396 |
| 3735 |
| 1953 |
| 1597 |
| 3809 |
| 1954 |
| 1955 |
+--------+--+
10 rows selected (0.149 seconds)
0: jdbc:hive2://node03.hadoop.com:10000>
内置的函数只能实现简单的json格式,复杂的json格式还是要定义函数来实现。