①SparkSQL兼容Hive的源数据库、兼容Hive的自定义函数、兼容Hive的序列化和反序列化。
②可以用SparkSQL替代Hive或者将Hive的引擎由MapReduce换成SparkSQL。
③启动SparkSQL必须是client模式,不能是cluster模式;
启动SparkSQL时可以指定启动模式,错误示例如下:
bin/spark-sql --deploy-mode cluster --master spark://linux02:7077 --driver-class-path /jars/mysql-connector-java-5.1.49.jar
上面这么些会报错,如下:
Cluster deploy mode is not applicable to Spark SQL shell.
整合步骤如下:
1.安装MySQL并创建一个普通用户,并且授权:
(1)CREATE USER ‘hive’@’%’ IDENTIFIED BY ‘123456’;
注:
①%代表可以在任意ip地址访问,如果需要设置网段,%的位置换成192.168.xxx的样式即可。
②’hive’代表用户名是hive,也可以是任意名称(做到见名知意)
注意:
直接写会报下面的错误:(密码太简单)
ERROR 1819 (HY000): Your password does not satisfy the current policy requirements(2)修改Mysql的密码安全级别和长度(生产环境不用修改):
set global validate_password_policy=LOW; set global validate_password_length=6;(3)指定用户spark对hivedb下的表有操作权限(对其他database没有权限)
GRANT ALL PRIVILEGES ON hivedb.* TO 'hive'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION; FLUSH PRIVILEGES;2.添加hive-site.xml配置文件
注:配置文件放在spark安装目录下的conf文件夹中。
配置文件如下:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <!-- 放在要启动sparksql的机器上的spark的conf目录下 --> <configuration> <!-- 连接mysql,如果不存在hivedb这个database,会自动创建 --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://linux02:3306/hivedb?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <!--mysql的jdbc连接驱动--> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <!-- 创建一个用户hive,不建议使用root用户,分配灵活,避免风险--> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <!--连接密码--> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>123456</value> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.schema.verification</name> <value>false</value> </property> <property> <name>datanucleus.schema.autoCreateAll</name> <value>true</value> </property> </configuration>3.上传mysql驱动、启动spark-sql:
(1)上传一个mysql连接驱动,可以将连接驱动放入到spark的安装包libs下(内存有限,也可以更改),或者其他地方创建一个文件夹存放jar包,启动spark-sql的时候使用-driver-class-path指定mysql连接驱动的位置:
现将需要指定目录的jar包都放在了/jars文件夹下,没有文件夹先创建一个:mkdir /jars;
bin/spark-sql --master spark://linux02:7077,node-5:7077 --driver-class-path /jars/mysql-connector-java-5.1.49.jar
4.连接mysql:
(1)启动Navicat,连接mysql,连接名/主机名:linux02,用户名:hive,密码:123456;
注:用户名不能是root,root需要用固定地址访问,而hive任意地址都可访问。
(2)修改DBS表中地址:
sparkSQL会自动在mysql上创建一个database(名字是hivedb),需要手动改一下DBS表中的
DB_LOCATION_UIR改成hdfs的地址(内部表需要改,外部表建表时可以指定)
默认地址:file:/opt/apps/spark-3.0.1-bin-hadoop3.2/bin/spark-warehouse
修改后地址:hdfs://linux01:8020/user/hive/warehouse
注意:
①修改完地址要刷新一下(Navicat左下角的刷新),也要重启spark-sql;
②修改后的主机是namenode的主机名,而不是启动spark-sql的主机;不然会出下面错误:
错误如下:
20/10/05 17:50:23 ERROR SparkSQLDriver: Failed in [create table t_user3 (id int,name string) row format delimited fields terminated by ','] java.io.EOFException: End of File Exception between local host is: "linux02/192.168.133.4"; destination host is: "linux02":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:789) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy27.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:903) ......5.重新启动SparkSQL的命令行:
bin/spark-sql --master spark://linux02:7077 --driver-class-path /root/mysql-connector-java-5.1.49.jar
6、利用spark-sql进行建表等操作:
①建表:
create table t_user3 (id bigint,name string,age int,fv double) row format delimited fields terminated by ‘,’;
②准备数据,加载到表中:
load data local inpath ‘/doit17/user/txt’ into table t_user3;
③查看表中数据:
select id,name,age,fv from t_user3 order by fv desc;
④查看hdfs中的数据:(在namenode节点上)
hdfs dfs -ls /user/hive/warehouse;
hdfs dfs -ls /user/hive/warehouse/t_user3;(有数据说明已经load到了hdfs中)
1、使用spark的hiveserver2的好处:
启动一个服务,通过beeline的方式连接,支持JDBC的工具都可以通过其进行连接,比如:其他机器可以通过beeline的方式连接。
2、启动hiveserver2: 启动之前,先启动hive的元数据库:hive --service metastore;
(1)sbin/start-thriftserver.sh --master spark://linux02:7077 --executor-memory 1g --total-executor-cores 4
启动完成后会启动一个application:spark-submit,但是没过一会就会自动停掉。
(2)上面启动方式存在问题,未指定MySQL的连接驱动:
sbin/start-thriftserver.sh --master spark://linux02:7077 --executor-memory 1g --total-executor-cores 4 --driver-class-path /jars/mysql-connector-java-5.1.49.jar
会启动一个spark-submit,这里spark-submit的功能是:hiveserver2。
(3)使用bin目录下beeline连接:
bin/beeline -u jdbc:hive2://linux02:10000
其中,-u是指定url。
(4)spark-sql-hive-server2:
