大数据——项目实战(使用Zeppelin分析电子商务消费行为分析)

    科技2022-07-15  116

    使用Zeppelin分析电子商务消费行为分析

    任务描述需求概述 问题分析Customer表Transaction表Store表Review表数据结构 连接至Zeppelin1.数据获取Download Data 2.检查行数和header行Understand the Data 3.上传数据到HDFSUpload the file to HDFS 4.建表查表4.1 Clear all tables if exists4.2 Verify all Tables are Created 5.数据清洗5.1 Clean and Mask customer_details5.2 Clean transaction_details into partition table5.3 Clean store_review table 6.Customer分析6.1找出顾客最常用的信用卡6.2找出客户资料中排名前五的职位名称6.3在美国女性最常用的信用卡6.4按性别和国家进行客户统计 7.Transaction分析7.1计算每月总收入7.2计算每个季度的总收入7.3按年计算总收入7.4按工作日计算总收入7.5按时间段计算总收入(需要清理数据)7.6按时间段计算平均消费7.7按工作日计算平均消费7.8计算年、月、日的交易总数7.9找出交易量最大的10个客户7.10找出消费最多的前10位顾客7.11统计该期间交易数量最少的用户7.12计算每个季度的独立客户总数7.13计算每周的独立客户总数7.14计算整个活动客户平均花费的最大值7.15统计每月花费最多的客户7.16统计每月访问次数最多的客户7.17按总价找出最受欢迎的5种产品7.18根据购买频率找出最畅销的5种产品7.19根据客户数量找出最受欢迎的5种产品7.20验证前5个details 8.Store分析8.1按客流量找出最受欢迎的商店8.2根据顾客消费价格找出最受欢迎的商店8.3根据顾客交易情况找出最受欢迎的商店8.4根据商店和唯一的顾客id获取最受欢迎的产品8.5获取每个商店的员工与顾客比8.6按年和月计算每家店的收入8.7按店铺制作总收益饼图8.8找出每个商店最繁忙的时间段8.9找出每家店的忠实顾客8.10根据每位员工的最高收入找出明星商店 9.Review分析9.1在ext_store_review中找出存在冲突的交易映射关系9.2了解客户评价的覆盖率9.3根据评分了解客户的分布情况9.4根据交易了解客户的分布情况9.5客户给出的最佳评价是否总是同一家门店

    电子商务消费行为分析文件 提取码:39r1

    任务描述

    需求概述

    对某零售企业最近一年门店收集的数据进行数据分析

    潜在客户画像用户消费统计门店的资源利用率消费的特征人群定位数据的可视化展现

    问题分析

    Customer表

    customer_detailsdetailscustomer_idInt,1-500first_namestringlast_namestringemailstring,such as willddy@gmail.comgenderstring,Male or Femaleaddressstringcountrystringlanguagestringjobstring,job title/positioncredit_typestring,credit card type。such as visacredit_nostrin,credit card number

    问题:language字段数据存在错误

    Transaction表

    transaction_detailsdetailstransaction_idInt,1-1000customer_idInt,1-500store_idInt,1-5pricedecimal,such as 5.08productstring,things boughtdatestring,when to purchasetimestring,what time to purchase

    问题:表中transa_id有重复,但数据有效,需要修复数据

    Store表

    store_detailsdetailsstore_idInt,1-5store_namestringemployee_numberInt,在store有多少employee

    Review表

    store_reviewdetailsstransaction_idInt,1-8000store_idInt,1-5review_storeInt,1-5

    问题:表中有效的score数据 表中有奖transaction_id映射到错误的store_id

    数据结构

    Customer表

    customer_details

    Transaction表

    transaction_details

    Store表

    store_details

    Review表

    store_review

    连接至Zeppelin

    Zeppelin安装和配置 打开Zeppelin网站 导入电子商务消费行为分析数据及模板

    1.数据获取

    Download Data

    %sh --创建数据库 mkdir /mnt/hivetest cd /mnt/hivetest wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/customer_details.csv wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/transaction_details.csv wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_details.csv wget -p https://raw.githubusercontent.com/datafibers/big_data_training/master/data/store_review.csv

    2.检查行数和header行

    Understand the Data

    %sh ## /tmp/data/ cd /mnt/hivetest head -2 customer_details.csv #head -2 store_details.csv #head -2 store_review.csv #head -2 transaction_details.csv

    3.上传数据到HDFS

    Upload the file to HDFS

    %sh hdfs dfs -mkdir -p /apps/shopping/customer hdfs dfs -put /mnt/hivetest/customer_details.csv /apps/shopping/customer hdfs dfs -mkdir -p /apps/shopping/transaction hdfs dfs -put /mnt/hivetest/transaction_details.csv /apps/shopping/transaction hdfs dfs -mkdir -p /apps/shopping/store hdfs dfs -put /mnt/hivetest/store_details.csv /apps/shopping/store hdfs dfs -mkdir -p /apps/shopping/review hdfs dfs -put /mnt/hivetest/store_review.csv /apps/shopping/review ##/tmp/shopping/data/customer/ ##/tmp/shopping/data/transaction/ ##/tmp/shopping/data/store/ ##/tmp/shopping/data/review/

    4.建表查表

    4.1 Clear all tables if exists

    %hive create database shopping use shopping %hive --创建顾客表 drop table ext_transaction_details %hive create external table if not exists ext_customer_details( customer_id string, first_name string, last_name string, email string, gender string, address string, country string, language string, job string, credit_type string, credit_no string) row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( "separatorChar"=",", "escapeChar"="\\") stored as textfile location '/apps/shopping/customer' %hive --创建交易流水表 create external table if not exists ext_transaction_details( transaction_id string, customer_id string, store_id string, price decimal(10,2), product string, purchase_date string, purchase_time string) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( "separatorChar"=",", "quoteChar"="\"", "escapeChar"="\\") stored as textfile location '/apps/shopping/transaction' %hive --创建商店详情表 create external table if not exists ext_store_details( store_id string, store_name string, employee_number int) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( "separatorChar"=",", "escapeChar"="\\") stored as textfile location '/apps/shopping/store' tblproperties("skip.head.line.count"="1") %hive --创建评价表 create external table if not exists ext_store_review( stransaction_id string, store_id string, review_score int) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' with serdeproperties( "separatorChar"=",", "escapeChar"="\\") stored as textfile location '/apps/shopping/review'

    4.2 Verify all Tables are Created

    %hive --select * from ext_customer_details limit 20 --select distinct language from ext_customer_details --select * from ext_transaction_details limit 20 select transaction_id,count(transaction_id) cnt from ext_transaction_details group by transaction_id order by cnt desc --select * from ext_store_details limit 20 --select * from ext_store_review limit 20

    5.数据清洗

    解决以下有问题的数据

    对transaction_details中的重复数据生成新ID过滤掉store_review中没有评分的数据可以把清洗好的数据放到另一个表或者用View表示找出PII(personal information identification)或PCI(personal confidential information)数据进行加密或hash重新组织transaction数据按照日期YYYY-MM做分区

    5.1 Clean and Mask customer_details

    %hive -- 敏感信息加密 -- drop view vm_customer_details create view if not exists vm_customer_details as select customer_id , first_name , unbase64(last_name) lastname, unbase64(email) email, gender , unbase64(address) address, country , language, job , credit_type , unbase64(credit_no) credit_no from ext_customer_details

    5.2 Clean transaction_details into partition table

    %hive -- 创建流水详情表 create table if not exists transaction_details ( transaction_id string, customer_id string, store_id string, price decimal(8,2), product string, purchase_date date, purchase_time string ) partitioned by(purchase_month string) -- select transaction_id,count(1) from ext_transaction_details group by transaction_id having count(1)>1 -- select * from ext_transaction_details where transaction_id=8001 set hive.exec.dynamic.partition.mode=nonstrict -- 开启动态分区 -- 重写数据 with base as ( select transaction_id, customer_id , store_id , price , product, purchase_date, purchase_time, from_unixtime(unix_timestamp(purchase_date,'yyyy-MM-dd'),'yyyy-MM') as purchase_month, row_number() over (partition by transaction_id order by store_id) as rn from ext_transaction_details ) insert overwrite table transaction_details partition(purchase_month) select if(rn=1,transaction_id,concat_ws('-',transaction_id,'_fix')) , customer_id , store_id , price , product, purchase_date , purchase_time, purchase_month from base -- 查看修复信息 select * from transaction_details where transaction_id like '%fix%'

    5.3 Clean store_review table

    %hive create view if not exists vw_store_review as select transaction_id, review_score from ext_store_review where review_score <> '' show tables

    最终会有七个表

    6.Customer分析

    6.1找出顾客最常用的信用卡

    %hive select credit_type,count(credit_type) cnt from customer_details group by credit_type order by cnt desc

    6.2找出客户资料中排名前五的职位名称

    %hive select job ,count(1) as pn from customer_details group by job order by pn desc limit 5

    6.3在美国女性最常用的信用卡

    %hive select credit_type,count(credit_type) cnt from customer_details where gender='Female' and country='United States' group by credit_type order by cnt desc limit 10

    6.4按性别和国家进行客户统计

    %hive select country,gender,count(1) cnt from customer_details group by country,gender

    7.Transaction分析

    7.1计算每月总收入

    %hive select month,store_id,sum(price) total from transaction_details group by store_id,month

    7.2计算每个季度的总收入

    %hive select year(purchase_date) year,concat ("第",floor((month(purchase_date)-1)/3)+1,"季度") season,sum(price) from transaction_details group by year(purchase_date) , concat("第",floor((month(purchase_date)-1)/3)+1,"季度") order by year desc,season

    7.3按年计算总收入

    %hive select year(purchase_date) yy,sum(price) toatal from transaction_details group by year(purchase_date) order by yy desc

    7.4按工作日计算总收入

    %hive select dayofweek(purchase_date) work_date,sum(price) from transaction_details where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)

    7.5按时间段计算总收入(需要清理数据)

    %hive -- 使用正则表达式清理数据然后使用case when 分组查询 with t1 as( select *, if(instr(purchase_time,'PM')>0, if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24, 0, cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12), cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans from transaction_details), t2 as( select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning' when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning' when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon' when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon' when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening' else 'night' end as timeSplit from t1) select t2.timeSplit,sum(price) from t2 group by t2.timeSplit

    7.6按时间段计算平均消费

    %hive -- 使用正则表达式清理数据然后使用case when 分组查询 with t1 as( select *, if(instr(purchase_time,'PM')>0, if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24, 0, cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12), cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans from transaction_details), t2 as( select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning' when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning' when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon' when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon' when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening' else 'night' end as timeSplit from t1) select t2.timeSplit,avg(price) from t2 group by t2.timeSplit

    7.7按工作日计算平均消费

    %hive select dayofweek(purchase_date) work_date,avg(price) from transaction_details where dayofweek(purchase_date) between 1 and 5 group by dayofweek(purchase_date)

    7.8计算年、月、日的交易总数

    %hive --select purchase_date,count(1) from transaction_details group by purchase_date --select concat(year(purchase_date),"-",month(purchase_date)),count(1) from transaction_details group by year(purchase_date),month(purchase_date) --select year(purchase_date),count(1) from transaction_details group by year(purchase_date)\ select count(1) over(partition by year(purchase_date)) year, count(1) over(partition by year(purchase_date),month(purchase_date)) month, count(1) over(partition by year(purchase_date),month(purchase_date),day(purchase_date))day from transaction_details

    7.9找出交易量最大的10个客户

    %hive select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a desc limit 10

    7.10找出消费最多的前10位顾客

    %hive select customer_id,sum(price)a from transaction_details group by customer_id order by a desc limit 10

    7.11统计该期间交易数量最少的用户

    %hive select customer_id,count(transaction_id)a from transaction_details group by customer_id order by a limit 1

    7.12计算每个季度的独立客户总数

    %hive select year(purchase_date) yy, concat("季度",floor((month(purchase_date)-1)/3)+1) season, count(distinct customer_id) from transaction_details group by year(purchase_date), concat("季度",floor((month(purchase_date)-1)/3)+1) order by yy desc,season

    7.13计算每周的独立客户总数

    %hive select concat(year(purchase_date),'年第',weekofyear(purchase_date),'周'),count(distinct customer_id) from transaction_details group by year(purchase_date),weekofyear(purchase_date)

    7.14计算整个活动客户平均花费的最大值

    %hive select a.customer_id,max(a.av) from (select customer_id,avg(price)av from transaction_details group by customer_id)a group by a.customer_id

    7.15统计每月花费最多的客户

    %hive select b.m,b.id,b.s from( select a.m,a.id,a.s ,row_number() over(partition by a.m order by a.s desc) as win1 from( select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id,sum(price) s from transaction_details group by year(purchase_date),month(purchase_date),customer_id)a) b where b.win1=1

    7.16统计每月访问次数最多的客户

    %hive select b.m,b.id,b.c from( select a.m,a.id,a.c,row_number() over(partition by a.m order by a.c desc) as win1 from( select concat(year(purchase_date),'-',month(purchase_date)) m,customer_id id, count(1) c from transaction_details group by year(purchase_date),month(purchase_date),customer_id) a) b where b.win1=1

    7.17按总价找出最受欢迎的5种产品

    %hive select product,sum(price)a from transaction_details group by product order by a desc limit 5

    7.18根据购买频率找出最畅销的5种产品

    %hive select product,count(transaction_id)a from transaction_details group by product order by a desc limit 5

    7.19根据客户数量找出最受欢迎的5种产品

    %hive select product,count(distinct customer_id)a from transaction_details group by product order by a desc limit 5

    7.20验证前5个details

    %hive select * from transaction_details where product in ('Goat - Whole Cut')

    8.Store分析

    8.1按客流量找出最受欢迎的商店

    %hive select store_id,count(transaction_id) a from transaction_details group by store_id order by a desc limit 1

    8.2根据顾客消费价格找出最受欢迎的商店

    %hive select store_id,sum(price)a from transaction_details group by store_id order by a desc limit 1

    8.3根据顾客交易情况找出最受欢迎的商店

    %hive select store_id,count(transaction_id)a,sum(price)b from transaction_details group by store_id order by a desc,b desc limit 1

    8.4根据商店和唯一的顾客id获取最受欢迎的产品

    %hive select b.store_id,b.product from ( select a.store_id,a.product,a.c ,row_number() over(partition by store_id order by a.c desc )as win1 from( select store_id,product,count(distinct customer_id) c from transaction_details group by store_id,product) a )b where b.win1 =1

    8.5获取每个商店的员工与顾客比

    %hive select a.store_id,concat_ws(':',cast(ceil(round(s.employee_number/a.c*100))as string),'100') from( select t.store_id,count(distinct customer_id) c from transaction_details t group by t.store_id)a join ext_store_details s on a.store_id=s.store_id

    8.6按年和月计算每家店的收入

    %hive select store_id,year(purchase_date),month(purchase_date), sum(price) from transaction_details group by store_id,year(purchase_date),month(purchase_date)

    8.7按店铺制作总收益饼图

    %hive select store_id,sum(price)a from transaction_details group by store_id

    8.8找出每个商店最繁忙的时间段

    %hive with t1 as( select *, if(instr(purchase_time,'PM')>0, if(cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12>=24, 0, cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)+12), cast(regexp_extract(purchase_time,'([0-9]{1,2}):([0-9]{2}\\w*)',1)as int)) as timeTrans from transaction_details), t2 as( select t1.*,case when t1.timeTrans<=8 and t1.timeTrans>5 then 'early morning' when t1.timeTrans<=11 and t1.timeTrans>8 then 'morning' when t1.timeTrans<=13 and t1.timeTrans>11 then 'noon' when t1.timeTrans<=18 and t1.timeTrans>13 then 'afternoon' when t1.timeTrans<=22 and t1.timeTrans>18 then 'evening' else 'night' end as timeSplit from t1), t3 as( select t2.store_id,t2.timeSplit,count(1) c from t2 group by t2.store_id,t2.timeSplit), t4 as( select t3.store_id,t3.timeSplit,row_number() over(partition by store_id order by t3.timeSplit desc)as win1 from t3 ) select t4.store_id,t4.timeSplit from t4 where t4.win1=1

    8.9找出每家店的忠实顾客

    %hive select b.* from (select store_id,customer_id,count(transaction_id)a from transaction_details group by store_id,customer_id)b where b.a>=5

    8.10根据每位员工的最高收入找出明星商店

    %hive -- 求总收入与雇员比值的最大值 with t1 as ( select store_id,sum(price) s from transaction_details group by store_id) select t1.store_id,t1.s/s.employee_number ss from t1 join ext_store_details s on s.store_id= t1.store_id order by ss desc limit 1

    9.Review分析

    9.1在ext_store_review中找出存在冲突的交易映射关系

    %hive select stransaction_id from store_review group by stransaction_id having count(1)>1

    9.2了解客户评价的覆盖率

    %hive -- 求各个店共有多少顾客评价 with t1 as( select t2.store_id,t1.transaction_id,t2.customer_id from store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id) select t1.store_id,count(distinct t1.customer_id) from t1 group by t1.store_id

    9.3根据评分了解客户的分布情况

    %hive -- 求每家店每个评分有多少个客户给的 with t1 as( select t2.store_id ,t1.review_score,t2.customer_id from store_review t1 join transaction_details t2 on t1.transaction_id=t2.transaction_id) select t1.store_id,t1.review_score,count(distinct customer_id) from t1 group by t1.store_id,t1.review_score

    9.4根据交易了解客户的分布情况

    %hive -- 求每家店每个客户的订单数 select store_id,customer_id,count(1) from transaction_details group by store_id,customer_id

    9.5客户给出的最佳评价是否总是同一家门店

    %hive with yes as(select customer_id,count(distinct s.store_id)cnt,"yes" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt=1), no as(select customer_id,count(distinct s.store_id)cnt,"no" from store_review s join transaction_details on substr(transaction_id,0,length(transaction_id)-2)=stransaction_id where substr(transaction_id,length(transaction_id),length(transaction_id))<>'2' group by customer_id having cnt>1) select * from yes union all select * from no

    Processed: 0.014, SQL: 8