Hive自定义函数实现Json数据解析UDF开发(作业)

    科技2022-07-13  134

     

    参考文章:https://www.pianshen.com/article/931082274/

    我做的工作:完全照猫画虎,一步步操作下来,写出完整的代码。

     

    Json数据解析UDF开发(作业)

    作业:

    有原始json数据如下:

    {"movie":"1193","rate":"5","timeStamp":"978300760","uid":"1"}

    {"movie":"661","rate":"3","timeStamp":"978302109","uid":"1"}

    {"movie":"914","rate":"3","timeStamp":"978301968","uid":"1"}

    {"movie":"3408","rate":"4","timeStamp":"978300275","uid":"1"}

    {"movie":"2355","rate":"5","timeStamp":"978824291","uid":"1"}

    {"movie":"1197","rate":"3","timeStamp":"978302268","uid":"1"}

    {"movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}

     

    需要将数据导入到hive数据仓库中

    要求:我不管你中间用几个表,最终我要得到一个结果表:

    movie

    rate

    timestamp

    uid

    1197

    3

    978302268

    1

     

     

    老师给出的思路:

    注:全在hive中完成,可以用自定义函数

    第一步:自定义udf函数,将我们json数据给解析出来,解析成四个字段,整成一个\t分割的一行

     

    第二步:注册我们的自定义函数

    第三步:创建一个临时表,加载json格式的数据,加载到临时表里面的一个字段里面去

    第四步:insert  overwrite  local  directory    将临时表当中的数据通过我们的自定义函数,给查询出来,放到本地路径下面去

    第五步:通过load  data的方式,将我们得数据加载到新表当中去

     

    学生作业:

    一、数据(这一步千万不要做错):

    {"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"}  

    #创建t_json

    0: jdbc:hive2://localhost:10000> create table  t_json(line string) format delimited;

    #导入数据

    0: jdbc:hive2://localhost:10000> load data local inpath '/export/datas/hivedata/data.json' into table t_json;

    导入后结果:

    0: jdbc:hive2://node03.hadoop.com:10000> select * from t_json;

    +----------------------------------------------------+--+

    |                    t_json.line                     |

    +----------------------------------------------------+--+

    | {"movie":"2717","rate":"3","timeStamp":"978298196","uid":"2"} |

    | {"movie":"2571","rate":"4","timeStamp":"978299773","uid":"2"} |

    | {"movie":"1917","rate":"3","timeStamp":"978300174","uid":"2"} |

    | {"movie":"2396","rate":"4","timeStamp":"978299641","uid":"2"} |

    | {"movie":"3735","rate":"3","timeStamp":"978298814","uid":"2"} |

    | {"movie":"1953","rate":"4","timeStamp":"978298775","uid":"2"} |

    | {"movie":"1597","rate":"3","timeStamp":"978300025","uid":"2"} |

    | {"movie":"3809","rate":"3","timeStamp":"978299712","uid":"2"} |

    | {"movie":"1954","rate":"5","timeStamp":"978298841","uid":"2"} |

    | {"movie":"1955","rate":"4","timeStamp":"978299200","uid":"2"} |

    | {"movie":"235","rate":"3","timeStamp":"978299351","uid":"2"} |

    | {"movie":"1124","rate":"5","timeStamp":"978299418","uid":"2"} |

    | {"movie":"1957","rate":"5","timeStamp":"978298750","uid":"2"} |

    | {"movie":"163","rate":"4","timeStamp":"978299809","uid":"2"} |

    +----------------------------------------------------+--+

    14 rows selected (0.047 seconds)

    验证一下数据是否符合要求:

    0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;

    +--------+--+

    | movie  |

    +--------+--+

    | 2717   |

    | 2571   |

    | 1917   |

    | 2396   |

    | 3735   |

    | 1953   |

    | 1597   |

    | 3809   |

    | 1954   |

    | 1955   |

    +--------+--+

    10 rows selected (0.05 seconds)

    0: jdbc:hive2://node03.hadoop.com:10000>

    以上数据初始化完成。

     

    二、自定义UDF函数源代码

    1、创建java项目。不介绍,项目创建好后,可以加入pom依赖,或者直接把hive依赖的jar包拷贝项目里,

    <?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0"          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"          xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">    <parent>        <artifactId>bigdata19</artifactId>        <groupId>cn.itcast.bigdata19</groupId>        <version>1.0-SNAPSHOT</version>    </parent>    <modelVersion>4.0.0</modelVersion>    <artifactId>day06_hiveudf</artifactId>     <!--  repositories定义我们去哪个仓库地址下载我们jar包  -->     <repositories>        <repository>            <id>cloudera</id>            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>        </repository>    </repositories>    <dependencies>         <!-- https://mvnrepository.com/artifact/org.json/json -->         <dependency>            <groupId>org.json</groupId>            <artifactId>json</artifactId>            <version>20190722</version>        </dependency>        <dependency>            <groupId>org.apache.hadoop</groupId>            <artifactId>hadoop-common</artifactId>            <version>2.8.0</version>             <!--<version>2.6.0-cdh5.14.0</version>-->         </dependency>        <dependency>            <groupId>org.apache.hive</groupId>            <artifactId>hive-exec</artifactId>            <version>1.2.1</version>             <!--<version>1.1.0-cdh5.14.0</version>-->             <!--scope 取值一共有四个              compile:编译要,打包要              test : 测试时候要,打包不要              provided : 开发时候要,打包不要              默认:compile              -->             <!-- <scope>provided</scope>-->         </dependency>    </dependencies>    <build>        <plugins>            <plugin>                <groupId>org.apache.maven.plugins</groupId>                <artifactId>maven-compiler-plugin</artifactId>                <version>3.0</version>                <configuration>                    <source>1.8</source>                    <target>1.8</target>                    <encoding>UTF-8</encoding>                </configuration>            </plugin>             <!-- 打包的插件,会将我们用到的包都打进去 -->             <plugin>                <groupId>org.apache.maven.plugins</groupId>                <artifactId>maven-shade-plugin</artifactId>                <version>2.2</version>                <executions>                    <execution>                        <phase>package</phase>                        <goals>                            <goal>shade</goal>                        </goals>                        <configuration>                            <filters>                                <filter>                                    <artifact>*:*</artifact>                                    <excludes>                                        <exclude>META-INF/*.SF</exclude>                                        <exclude>META-INF/*.DSA</exclude>                                        <exclude>META-INF/*/RSA</exclude>                                    </excludes>                                </filter>                            </filters>                        </configuration>                    </execution>                </executions>            </plugin>        </plugins>    </build></project>

     

    2、创建javaJsonParser

    package cn.itcast.hive.udf;import org.apache.hadoop.hive.ql.exec.UDF;import parquet.org.codehaus.jackson.map.ObjectMapper;import java.io.IOException;public class JsonParser extends UDF {     ObjectMapper objectMapper = new ObjectMapper();     /*     public String evaluate(String jsonLine){         MovieRateBean readValue;         try{             readValue = objectMapper.readValue(jsonLine, MovieRateBean.class);             return readValue.toString();         }catch(IOException e){             //e.printStackTrace();         }         return "";     }*/     ///*     public String evaluate(String jsonLine){         if(null == jsonLine){             return null;         }         if(jsonLine.trim() == ""){             return "";         }         MovieRateBean readValue;         try{             readValue = objectMapper.readValue(jsonLine, MovieRateBean.class);             return readValue.toString();         }catch(IOException e){             e.printStackTrace();         }         return "";     }     //*/     public static void main(String[] args) {         //{“movie":"1287","rate":"5","timeStamp":"978302039","uid":"1"}         String jsonStr = "{\"movie\":\"1287\",\"rate\":\"5\",\"timeStamp\":\"978302039\",\"uid\":\"1\"}";         JsonParser jsonParser = new JsonParser();         String result = jsonParser.evaluate(jsonStr);         System.out.println(result);     } }

    3、创建bean对象:

    package cn.itcast.hive.udf;public class MovieRateBean {     private String movie;     private String rate;     private String timeStamp;     private String uid;     @Override     public String toString() {         return (movie + ',' + rate + ',' + timeStamp + ',' + uid);     }     public String getMovie() {         return movie;     }     public void setMovie(String movie) {         this.movie = movie;     }     public String getRate() {         return rate;     }     public void setRate(String rate) {         this.rate = rate;     }     public String getTimeStamp() {         return timeStamp;     }     public void setTimeStamp(String timeStamp) {         this.timeStamp = timeStamp;     }     public String getUid() {         return uid;     }     public void setUid(String uid) {         this.uid = uid;     } }

    4、打包,放到hive所在的服务器。

    [root@node03 lib]# rz -E

    rz waiting to receive.

    [root@node03 lib]# mv day06_hiveudf-1.0-SNAPSHOT.jar JsonParser.jar

    [root@node03 lib]# pwd

    /export/servers/hive-1.1.0-cdh5.14.0/lib

    [root@node03 lib]#

    5、执行命令,把jar包加入classpath里:

    0: jdbc:hive2://node03.hadoop.com:10000> add jar /export/servers/hive-1.1.0-cdh5.14.0/lib/JsonParser.jar;

    6、创建函数:

    0: jdbc:hive2://node03.hadoop.com:10000> create temporary function parseJson as 'cn.itcast.hive.udf.JsonParser';

    7、测试:

    0: jdbc:hive2://node03.hadoop.com:10000> select parseJson(line) from t_json;

    +---------------------+--+

    |         _c0         |

    +---------------------+--+

    | 2717,3,978298196,2  |

    | 2571,4,978299773,2  |

    | 1917,3,978300174,2  |

    | 2396,4,978299641,2  |

    | 3735,3,978298814,2  |

    | 1953,4,978298775,2  |

    | 1597,3,978300025,2  |

    | 3809,3,978299712,2  |

    | 1954,5,978298841,2  |

    | 1955,4,978299200,2  |

    | 235,3,978299351,2   |

    | 1124,5,978299418,2  |

    | 1957,5,978298750,2  |

    | 163,4,978299809,2   |

    +---------------------+--+

    14 rows selected (0.358 seconds)

    到这里,自定义函数创建成功。

     

    三、在hive中进行数据切割和处理

    跟进逗号进行分割:

    0: jdbc:hive2://node03.hadoop.com:10000> select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;

    +--------+-------+------------+-----+--+

    | movie  | rate  | timestamp  | id  |

    +--------+-------+------------+-----+--+

    | 2717   | 3     | 978298196  | 2   |

    | 2571   | 4     | 978299773  | 2   |

    | 1917   | 3     | 978300174  | 2   |

    | 2396   | 4     | 978299641  | 2   |

    | 3735   | 3     | 978298814  | 2   |

    | 1953   | 4     | 978298775  | 2   |

    | 1597   | 3     | 978300025  | 2   |

    | 3809   | 3     | 978299712  | 2   |

    | 1954   | 5     | 978298841  | 2   |

    | 1955   | 4     | 978299200  | 2   |

    | 235    | 3     | 978299351  | 2   |

    | 1124   | 5     | 978299418  | 2   |

    | 1957   | 5     | 978298750  | 2   |

    | 163    | 4     | 978299809  | 2   |

    +--------+-------+------------+-----+--+

    14 rows selected (0.064 seconds)

     

     

    查询的结果保存到hive表里:

    0: jdbc:hive2://node03.hadoop.com:10000> create table t_rate as select split(parseJson(line),',')[0] movie, split(parseJson(line),',')[1] rate, split(parseJson(line),',')[2] timeStamp, split(parseJson(line),',')[3] id from t_json;

    No rows affected (13.179 seconds)

    0: jdbc:hive2://node03.hadoop.com:10000> select * from t_rate;

    +---------------+--------------+-------------------+------------+--+

    | t_rate.movie  | t_rate.rate  | t_rate.timestamp  | t_rate.id  |

    +---------------+--------------+-------------------+------------+--+

    | 2717          | 3            | 978298196         | 2          |

    | 2571          | 4            | 978299773         | 2          |

    | 1917          | 3            | 978300174         | 2          |

    | 2396          | 4            | 978299641         | 2          |

    | 3735          | 3            | 978298814         | 2          |

    | 1953          | 4            | 978298775         | 2          |

    | 1597          | 3            | 978300025         | 2          |

    | 3809          | 3            | 978299712         | 2          |

    | 1954          | 5            | 978298841         | 2          |

    | 1955          | 4            | 978299200         | 2          |

    | 235           | 3            | 978299351         | 2          |

    | 1124          | 5            | 978299418         | 2          |

    | 1957          | 5            | 978298750         | 2          |

    | 163           | 4            | 978299809         | 2          |

    +---------------+--------------+-------------------+------------+--+

    14 rows selected (0.094 seconds)

     

     

     

    以上是通过自定义hive函数来解决的方式,其实hive有一个内置函数,接下来演示下内置函数:

    0: jdbc:hive2://node03.hadoop.com:10000> select get_json_object(line,'$.movie') as movie from t_json limit 10;

    +--------+--+

    | movie  |

    +--------+--+

    | 2717   |

    | 2571   |

    | 1917   |

    | 2396   |

    | 3735   |

    | 1953   |

    | 1597   |

    | 3809   |

    | 1954   |

    | 1955   |

    +--------+--+

    10 rows selected (0.149 seconds)

    0: jdbc:hive2://node03.hadoop.com:10000>

     

     

    内置的函数只能实现简单的json格式,复杂的json格式还是要定义函数来实现。

     

    Processed: 0.013, SQL: 8