使用Zeppelin分析电子商务消费行为分析
任务描述需求概述
问题分析Customer表Transaction表Store表Review表数据结构
连接至Zeppelin1.数据获取Download Data
2.检查行数和header行Understand the Data
3.上传数据到HDFSUpload the file to HDFS
4.建表查表4.1 Clear all tables if exists4.2 Verify all Tables are Created
5.数据清洗5.1 Clean and Mask customer_details5.2 Clean transaction_details into partition table5.3 Clean store_review table
6.Customer分析6.1找出顾客最常用的信用卡6.2找出客户资料中排名前五的职位名称6.3在美国女性最常用的信用卡6.4按性别和国家进行客户统计
7.Transaction分析7.1计算每月总收入7.2计算每个季度的总收入7.3按年计算总收入7.4按工作日计算总收入7.5按时间段计算总收入(需要清理数据)7.6按时间段计算平均消费7.7按工作日计算平均消费7.8计算年、月、日的交易总数7.9找出交易量最大的10个客户7.10找出消费最多的前10位顾客7.11统计该期间交易数量最少的用户7.12计算每个季度的独立客户总数7.13计算每周的独立客户总数7.14计算整个活动客户平均花费的最大值7.15统计每月花费最多的客户7.16统计每月访问次数最多的客户7.17按总价找出最受欢迎的5种产品7.18根据购买频率找出最畅销的5种产品7.19根据客户数量找出最受欢迎的5种产品7.20验证前5个details
8.Store分析8.1按客流量找出最受欢迎的商店8.2根据顾客消费价格找出最受欢迎的商店8.3根据顾客交易情况找出最受欢迎的商店8.4根据商店和唯一的顾客id获取最受欢迎的产品8.5获取每个商店的员工与顾客比8.6按年和月计算每家店的收入8.7按店铺制作总收益饼图8.8找出每个商店最繁忙的时间段8.9找出每家店的忠实顾客8.10根据每位员工的最高收入找出明星商店
9.Review分析9.1在ext_store_review中找出存在冲突的交易映射关系9.2了解客户评价的覆盖率9.3根据评分了解客户的分布情况9.4根据交易了解客户的分布情况9.5客户给出的最佳评价是否总是同一家门店
电子商务消费行为分析文件 提取码:39r1
任务描述
需求概述
对某零售企业最近一年门店收集的数据进行数据分析
潜在客户画像用户消费统计门店的资源利用率消费的特征人群定位数据的可视化展现
问题分析
Customer表
customer_detailsdetails
customer_idInt,1-500first_namestringlast_namestringemailstring,such as willddy@gmail.comgenderstring,Male or Femaleaddressstringcountrystringlanguagestringjobstring,job title/positioncredit_typestring,credit card type。such as visacredit_nostrin,credit card number
问题:language字段数据存在错误
Transaction表
transaction_detailsdetails
transaction_idInt,1-1000customer_idInt,1-500store_idInt,1-5pricedecimal,such as 5.08productstring,things boughtdatestring,when to purchasetimestring,what time to purchase
问题:表中transa_id有重复,但数据有效,需要修复数据
Store表
store_detailsdetails
store_idInt,1-5store_namestringemployee_numberInt,在store有多少employee
Review表
store_reviewdetails
stransaction_idInt,1-8000store_idInt,1-5review_storeInt,1-5
问题:表中有效的score数据 表中有奖transaction_id映射到错误的store_id
数据结构
Customer表
customer_details
Transaction表
transaction_details
Store表
store_details
Review表
store_review
连接至Zeppelin
Zeppelin安装和配置 打开Zeppelin网站 导入电子商务消费行为分析数据及模板
1.数据获取
Download Data
%sh
--创建数据库
mkdir
/mnt
/hivetest
cd
/mnt
/hivetest
wget
-p https
://raw
.githubusercontent
.com
/datafibers
/big_data_training
/master
/data
/customer_details
.csv
wget
-p https
://raw
.githubusercontent
.com
/datafibers
/big_data_training
/master
/data
/transaction_details
.csv
wget
-p https
://raw
.githubusercontent
.com
/datafibers
/big_data_training
/master
/data
/store_details
.csv
wget
-p https
://raw
.githubusercontent
.com
/datafibers
/big_data_training
/master
/data
/store_review
.csv
2.检查行数和header行
Understand the Data
%sh
##
/tmp
/data
/
cd
/mnt
/hivetest
head
-2 customer_details
.csv
#head
-2 store_details
.csv
#head
-2 store_review
.csv
#head
-2 transaction_details
.csv
3.上传数据到HDFS
Upload the file to HDFS
%sh
hdfs dfs
-mkdir
-p
/apps
/shopping
/customer
hdfs dfs
-put
/mnt
/hivetest
/customer_details
.csv
/apps
/shopping
/customer
hdfs dfs
-mkdir
-p
/apps
/shopping
/transaction
hdfs dfs
-put
/mnt
/hivetest
/transaction_details
.csv
/apps
/shopping
/transaction
hdfs dfs
-mkdir
-p
/apps
/shopping
/store
hdfs dfs
-put
/mnt
/hivetest
/store_details
.csv
/apps
/shopping
/store
hdfs dfs
-mkdir
-p
/apps
/shopping
/review
hdfs dfs
-put
/mnt
/hivetest
/store_review
.csv
/apps
/shopping
/review
##
/tmp
/shopping
/data
/customer
/
##
/tmp
/shopping
/data
/transaction
/
##
/tmp
/shopping
/data
/store
/
##
/tmp
/shopping
/data
/review
/
4.建表查表
4.1 Clear all tables if exists
%hive
create database shopping
use shopping
%hive
--创建顾客表
drop table ext_transaction_details
%hive
create external table
if not exists
ext_customer_details(
customer_id string
,
first_name string
,
last_name string
,
email string
,
gender string
,
address string
,
country string
,
language string
,
job string
,
credit_type string
,
credit_no string
)
row format
SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored
as textfile
location
'/apps/shopping/customer'
%hive
--创建交易流水表
create external table
if not exists
ext_transaction_details(
transaction_id string
,
customer_id string
,
store_id string
,
price
decimal(10,2),
product string
,
purchase_date string
,
purchase_time string
)
row format serde
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\")
stored
as textfile
location
'/apps/shopping/transaction'
%hive
--创建商店详情表
create external table
if not exists
ext_store_details(
store_id string
,
store_name string
,
employee_number int
)
row format serde
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored
as textfile
location
'/apps/shopping/store'
tblproperties("skip.head.line.count"="1")
%hive
--创建评价表
create external table
if not exists
ext_store_review(
stransaction_id string
,
store_id string
,
review_score int
)
row format serde
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"escapeChar"="\\")
stored
as textfile
location
'/apps/shopping/review'
4.2 Verify all Tables are Created
%hive
--select
* from ext_customer_details limit
20
--select distinct language
from ext_customer_details
--select
* from ext_transaction_details limit
20
select transaction_id
,count(transaction_id
) cnt
from ext_transaction_details group by transaction_id order by cnt desc
--select
* from ext_store_details limit
20
--select
* from ext_store_review limit
20
5.数据清洗
解决以下有问题的数据
对transaction_details中的重复数据生成新ID过滤掉store_review中没有评分的数据可以把清洗好的数据放到另一个表或者用View表示找出PII(personal information identification)或PCI(personal confidential information)数据进行加密或hash重新组织transaction数据按照日期YYYY-MM做分区
5.1 Clean and Mask customer_details
%hive
-- 敏感信息加密
-- drop view vm_customer_details
create view
if not exists vm_customer_details
as
select
customer_id
,
first_name
,
unbase64(last_name
) lastname
,
unbase64(email
) email
,
gender
,
unbase64(address
) address
,
country
,
language
,
job
,
credit_type
,
unbase64(credit_no
) credit_no
from
ext_customer_details
5.2 Clean transaction_details into partition table
%hive
-- 创建流水详情表
create table
if not exists
transaction_details
(
transaction_id string
,
customer_id string
,
store_id string
,
price
decimal(8,2),
product string
,
purchase_date date
,
purchase_time string
)
partitioned
by(purchase_month string
)
-- select transaction_id
,count(1) from ext_transaction_details group by transaction_id having
count(1)>1
-- select
* from ext_transaction_details where transaction_id
=8001
set hive
.exec
.dynamic
.partition
.mode
=nonstrict
-- 开启动态分区
-- 重写数据
with base
as (
select
transaction_id
,
customer_id
,
store_id
,
price
,
product
,
purchase_date
,
purchase_time
,
from_unixtime(unix_timestamp(purchase_date
,'yyyy-MM-dd'),'yyyy-MM') as purchase_month
,
row_number() over (partition by transaction_id order by store_id
) as rn
from ext_transaction_details
)
insert overwrite table transaction_details
partition(purchase_month
)
select
if(rn
=1,transaction_id
,concat_ws('-',transaction_id
,'_fix')) ,
customer_id
,
store_id
,
price
,
product
,
purchase_date
,
purchase_time
,
purchase_month
from base
-- 查看修复信息
select
* from transaction_details where transaction_id like
'%fix%'
5.3 Clean store_review table
%hive
create view
if not exists vw_store_review
as
select
transaction_id
,
review_score
from ext_store_review where review_score
<> ''
show tables
最终会有七个表
6.Customer分析
6.1找出顾客最常用的信用卡
%hive
select credit_type
,count(credit_type
) cnt
from customer_details group by credit_type order by cnt desc
6.2找出客户资料中排名前五的职位名称
%hive
select job
,count(1) as pn
from customer_details group by job
order by pn desc limit
5
6.3在美国女性最常用的信用卡
%hive
select credit_type
,count(credit_type
) cnt
from customer_details where gender
='Female' and country
='United States' group by credit_type order by cnt desc limit
10
6.4按性别和国家进行客户统计
%hive
select country
,gender
,count(1) cnt
from customer_details group by country
,gender
7.Transaction分析
7.1计算每月总收入
%hive
select month
,store_id
,sum(price
) total
from transaction_details group by store_id
,month
7.2计算每个季度的总收入
%hive
select
year(purchase_date
) year
,concat
("第",floor((month(purchase_date
)-1)/3)+1,"季度") season
,sum(price
)
from transaction_details group by
year(purchase_date
) ,
concat("第",floor((month(purchase_date
)-1)/3)+1,"季度") order by year desc
,season
7.3按年计算总收入
%hive
select
year(purchase_date
) yy
,sum(price
) toatal
from transaction_details group by
year(purchase_date
) order by yy desc
7.4按工作日计算总收入
%hive
select
dayofweek(purchase_date
) work_date
,sum(price
) from transaction_details
where
dayofweek(purchase_date
) between
1 and
5 group by
dayofweek(purchase_date
)
7.5按时间段计算总收入(需要清理数据)
%hive
-- 使用正则表达式清理数据然后使用
case when 分组查询
with
t1
as(
select
*, if(instr(purchase_time
,'PM')>0,
if(cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12>=24,
0,
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12),
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)) as timeTrans
from transaction_details
), t2
as(
select t1
.*,case when t1
.timeTrans
<=8 and t1
.timeTrans
>5 then
'early morning'
when t1
.timeTrans
<=11 and t1
.timeTrans
>8 then
'morning'
when t1
.timeTrans
<=13 and t1
.timeTrans
>11 then
'noon'
when t1
.timeTrans
<=18 and t1
.timeTrans
>13 then
'afternoon'
when t1
.timeTrans
<=22 and t1
.timeTrans
>18 then
'evening'
else 'night'
end
as timeSplit
from t1
)
select t2
.timeSplit
,sum(price
)
from t2
group by t2
.timeSplit
7.6按时间段计算平均消费
%hive
-- 使用正则表达式清理数据然后使用
case when 分组查询
with
t1
as(
select
*, if(instr(purchase_time
,'PM')>0,
if(cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12>=24,
0,
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12),
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)) as timeTrans
from transaction_details
), t2
as(
select t1
.*,case when t1
.timeTrans
<=8 and t1
.timeTrans
>5 then
'early morning'
when t1
.timeTrans
<=11 and t1
.timeTrans
>8 then
'morning'
when t1
.timeTrans
<=13 and t1
.timeTrans
>11 then
'noon'
when t1
.timeTrans
<=18 and t1
.timeTrans
>13 then
'afternoon'
when t1
.timeTrans
<=22 and t1
.timeTrans
>18 then
'evening'
else 'night'
end
as timeSplit
from t1
)
select t2
.timeSplit
,avg(price
)
from t2
group by t2
.timeSplit
7.7按工作日计算平均消费
%hive
select
dayofweek(purchase_date
) work_date
,avg(price
) from transaction_details
where
dayofweek(purchase_date
) between
1 and
5 group by
dayofweek(purchase_date
)
7.8计算年、月、日的交易总数
%hive
--select purchase_date
,count(1) from transaction_details group by purchase_date
--select
concat(year(purchase_date
),"-",month(purchase_date
)),count(1) from transaction_details group by
year(purchase_date
),month(purchase_date
)
--select
year(purchase_date
),count(1) from transaction_details group by
year(purchase_date
)\
select
count(1) over(partition by
year(purchase_date
)) year
,
count(1) over(partition by
year(purchase_date
),month(purchase_date
)) month
,
count(1) over(partition by
year(purchase_date
),month(purchase_date
),day(purchase_date
))day
from transaction_details
7.9找出交易量最大的10个客户
%hive
select customer_id
,count(transaction_id
)a
from transaction_details group by customer_id order by a desc limit
10
7.10找出消费最多的前10位顾客
%hive
select customer_id
,sum(price
)a
from transaction_details group by customer_id order by a desc limit
10
7.11统计该期间交易数量最少的用户
%hive
select customer_id
,count(transaction_id
)a
from transaction_details group by customer_id order by a limit
1
7.12计算每个季度的独立客户总数
%hive
select
year(purchase_date
) yy
,
concat("季度",floor((month(purchase_date
)-1)/3)+1) season
,
count(distinct customer_id
) from transaction_details group by
year(purchase_date
),
concat("季度",floor((month(purchase_date
)-1)/3)+1)
order by yy desc
,season
7.13计算每周的独立客户总数
%hive
select
concat(year(purchase_date
),'年第',weekofyear(purchase_date
),'周'),count(distinct customer_id
) from transaction_details group by
year(purchase_date
),weekofyear(purchase_date
)
7.14计算整个活动客户平均花费的最大值
%hive
select a
.customer_id
,max(a
.av
) from (select customer_id
,avg(price
)av
from
transaction_details group by customer_id
)a group by a
.customer_id
7.15统计每月花费最多的客户
%hive
select b
.m
,b
.id
,b
.s
from(
select a
.m
,a
.id
,a
.s
,row_number() over(partition by a
.m order by a
.s desc
) as win1
from(
select
concat(year(purchase_date
),'-',month(purchase_date
)) m
,customer_id id
,sum(price
) s
from transaction_details
group by
year(purchase_date
),month(purchase_date
),customer_id
)a
) b
where b
.win1
=1
7.16统计每月访问次数最多的客户
%hive
select b
.m
,b
.id
,b
.c
from(
select a
.m
,a
.id
,a
.c
,row_number() over(partition by a
.m order by a
.c desc
) as win1
from(
select
concat(year(purchase_date
),'-',month(purchase_date
)) m
,customer_id id
, count(1) c
from transaction_details
group by
year(purchase_date
),month(purchase_date
),customer_id
) a
) b
where b
.win1
=1
7.17按总价找出最受欢迎的5种产品
%hive
select product
,sum(price
)a
from transaction_details group by product order by a desc limit
5
7.18根据购买频率找出最畅销的5种产品
%hive
select product
,count(transaction_id
)a
from transaction_details group by product order by a desc limit
5
7.19根据客户数量找出最受欢迎的5种产品
%hive
select product
,count(distinct customer_id
)a
from transaction_details group by product order by a desc limit
5
7.20验证前5个details
%hive
select
* from transaction_details where product
in ('Goat - Whole Cut')
8.Store分析
8.1按客流量找出最受欢迎的商店
%hive
select store_id
,count(transaction_id
) a
from transaction_details group by store_id order by a desc limit
1
8.2根据顾客消费价格找出最受欢迎的商店
%hive
select store_id
,sum(price
)a
from transaction_details group by store_id order by a desc limit
1
8.3根据顾客交易情况找出最受欢迎的商店
%hive
select store_id
,count(transaction_id
)a
,sum(price
)b
from transaction_details group by store_id order by a desc
,b desc limit
1
8.4根据商店和唯一的顾客id获取最受欢迎的产品
%hive
select b
.store_id
,b
.product
from (
select a
.store_id
,a
.product
,a
.c
,row_number() over(partition by store_id order by a
.c desc
)as win1
from(
select store_id
,product
,count(distinct customer_id
) c
from transaction_details
group by store_id
,product
) a
)b
where b
.win1
=1
8.5获取每个商店的员工与顾客比
%hive
select a
.store_id
,concat_ws(':',cast(ceil(round(s
.employee_number
/a
.c
*100))as string
),'100')
from(
select t
.store_id
,count(distinct customer_id
) c
from transaction_details t
group by t
.store_id
)a join ext_store_details s
on a
.store_id
=s
.store_id
8.6按年和月计算每家店的收入
%hive
select store_id
,year(purchase_date
),month(purchase_date
), sum(price
) from transaction_details group by store_id
,year(purchase_date
),month(purchase_date
)
8.7按店铺制作总收益饼图
%hive
select store_id
,sum(price
)a
from transaction_details group by store_id
8.8找出每个商店最繁忙的时间段
%hive
with
t1
as(
select
*, if(instr(purchase_time
,'PM')>0,
if(cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12>=24,
0,
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)+12),
cast(regexp_extract(purchase_time
,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int
)) as timeTrans
from transaction_details
), t2
as(
select t1
.*,case when t1
.timeTrans
<=8 and t1
.timeTrans
>5 then
'early morning'
when t1
.timeTrans
<=11 and t1
.timeTrans
>8 then
'morning'
when t1
.timeTrans
<=13 and t1
.timeTrans
>11 then
'noon'
when t1
.timeTrans
<=18 and t1
.timeTrans
>13 then
'afternoon'
when t1
.timeTrans
<=22 and t1
.timeTrans
>18 then
'evening'
else 'night'
end
as timeSplit
from t1
),
t3
as(
select t2
.store_id
,t2
.timeSplit
,count(1) c
from t2
group by t2
.store_id
,t2
.timeSplit
),
t4
as(
select t3
.store_id
,t3
.timeSplit
,row_number() over(partition by store_id order by t3
.timeSplit desc
)as win1
from t3
)
select t4
.store_id
,t4
.timeSplit
from t4
where t4
.win1
=1
8.9找出每家店的忠实顾客
%hive
select b
.* from (select store_id
,customer_id
,count(transaction_id
)a
from transaction_details group by store_id
,customer_id
)b where b
.a
>=5
8.10根据每位员工的最高收入找出明星商店
%hive
-- 求总收入与雇员比值的最大值
with
t1
as (
select store_id
,sum(price
) s
from transaction_details
group by store_id
)
select t1
.store_id
,t1
.s
/s
.employee_number ss
from t1 join ext_store_details s on s
.store_id
= t1
.store_id
order by ss desc
limit
1
9.Review分析
9.1在ext_store_review中找出存在冲突的交易映射关系
%hive
select stransaction_id
from store_review
group by stransaction_id
having
count(1)>1
9.2了解客户评价的覆盖率
%hive
-- 求各个店共有多少顾客评价
with
t1
as(
select t2
.store_id
,t1
.transaction_id
,t2
.customer_id
from store_review t1 join transaction_details t2 on t1
.transaction_id
=t2
.transaction_id
)
select t1
.store_id
,count(distinct t1
.customer_id
)
from t1
group by t1
.store_id
9.3根据评分了解客户的分布情况
%hive
-- 求每家店每个评分有多少个客户给的
with
t1
as(
select t2
.store_id
,t1
.review_score
,t2
.customer_id
from store_review t1 join transaction_details t2 on t1
.transaction_id
=t2
.transaction_id
)
select t1
.store_id
,t1
.review_score
,count(distinct customer_id
)
from t1
group by t1
.store_id
,t1
.review_score
9.4根据交易了解客户的分布情况
%hive
-- 求每家店每个客户的订单数
select store_id
,customer_id
,count(1)
from transaction_details
group by store_id
,customer_id
9.5客户给出的最佳评价是否总是同一家门店
%hive
with yes
as(select customer_id
,count(distinct s
.store_id
)cnt
,"yes" from store_review s join transaction_details on
substr(transaction_id
,0,length(transaction_id
)-2)=stransaction_id where
substr(transaction_id
,length(transaction_id
),length(transaction_id
))<>'2' group by customer_id having cnt
=1),
no
as(select customer_id
,count(distinct s
.store_id
)cnt
,"no" from store_review s join transaction_details on
substr(transaction_id
,0,length(transaction_id
)-2)=stransaction_id where
substr(transaction_id
,length(transaction_id
),length(transaction_id
))<>'2' group by customer_id having cnt
>1)
select
* from yes union all select
* from no