Skip to content

Commit f47823e

Browse files
authored
Create 8、Hive实战:统计影音视频网站的常规指标.md
1 parent 9e2b206 commit f47823e

File tree

1 file changed

+314
-0
lines changed

1 file changed

+314
-0
lines changed
Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
## Hive实战:统计影音视频网站的常规指标
2+
3+
### 1、需求描述
4+
5+
统计硅谷影音视频网站的常规指标,各种TopN指标:
6+
7+
--统计视频观看数Top10
8+
9+
--统计视频类别热度Top10
10+
11+
--统计视频观看数Top20所属类别
12+
13+
--统计视频观看数Top50所关联视频的所属类别Rank
14+
15+
--统计每个类别中的视频热度Top10
16+
17+
--统计每个类别视频观看数Top10
18+
19+
### 2、数据结构
20+
21+
表:视频表
22+
23+
| 字段 | 备注 | 详细描述 |
24+
| ----------- | --------------------------- | ---------------------- |
25+
| video id | 视频唯一id(String) | 11位字符串 |
26+
| uploader | 视频上传者(String) | 上传视频的用户名String |
27+
| age | 视频年龄(int) | 视频在平台上的整数天 |
28+
| category | 视频类别(Array<String>) | 上传视频指定的视频分类 |
29+
| length | 视频长度(Int) | 整形数字标识的视频长度 |
30+
| views | 观看次数(Int) | 视频被浏览的次数 |
31+
| rate | 视频评分(Double) | 满分5分 |
32+
| Ratings | 流量(Int) | 视频的流量,整型数字 |
33+
| conments | 评论数(Int) | 一个视频的整数评论数 |
34+
| related ids | 相关视频id(Array<String>) | 相关视频的id,最多20个 |
35+
36+
### 3、准备工作
37+
38+
(1)创建外部表:gulivideo_ori
39+
40+
```sql
41+
create external table video_ori(
42+
videoId string,
43+
uploader string,
44+
age int,
45+
category array<string>,
46+
length int,
47+
views int,
48+
rate float,
49+
ratings int,
50+
comments int,
51+
relatedId array<string>)
52+
row format delimited fields terminated by "\t"
53+
collection items terminated by "&"
54+
location '/gulivideo/video';
55+
```
56+
57+
(2)然后建立内部表
58+
59+
```sql
60+
create table video_orc(
61+
videoId string,
62+
uploader string,
63+
age int,
64+
category array<string>,
65+
length int,
66+
views int,
67+
rate float,
68+
ratings int,
69+
comments int,
70+
relatedId array<string>)
71+
stored as orc
72+
tblproperties("orc.compress"="SNAPPY");
73+
```
74+
75+
(3)向ORC表插入数据
76+
77+
```sql
78+
insert into table video_orc select * from video_ori;
79+
```
80+
81+
### 4、业务分析
82+
83+
#### 4.1 统计视频观看数Top10
84+
85+
思路:使用order by按照views字段做一个全局排序即可,同时我们设置只显示前10条。
86+
87+
最终代码:
88+
89+
```sql
90+
SELECT
91+
videoid,
92+
views
93+
FROM
94+
video_orc
95+
ORDER BY
96+
views DESC
97+
LIMIT 10;
98+
```
99+
100+
#### 4.2 统计视频类别热度Top10
101+
102+
思路:
103+
104+
(1)即统计每个类别有多少个视频,显示出包含视频最多的前10个类别。
105+
106+
(2)我们需要按照类别group by聚合,然后count组内的videoId个数即可。
107+
108+
(3)因为当前表结构为:一个视频对应一个或多个类别。所以如果要group by类别,需要先将类别进行列转行(展开),然后再进行count即可。
109+
110+
(4)最后按照热度排序,显示前10条。
111+
112+
最终代码:
113+
114+
```sql
115+
SELECT
116+
cate,
117+
COUNT(videoid) n
118+
FROM
119+
(
120+
SELECT
121+
videoid,
122+
cate
123+
FROM
124+
video_orc LATERAL VIEW explode(category) tbl as cate) t1
125+
GROUP BY
126+
cate
127+
ORDER BY
128+
n desc
129+
limit 10;
130+
```
131+
132+
#### 4.3 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
133+
134+
思路:
135+
136+
(1)先找到观看数最高的20个视频所属条目的所有信息,降序排列
137+
138+
(2)把这20条信息中的category分裂出来(列转行)
139+
140+
(3)最后查询视频分类名称和该分类下有多少个Top20的视频
141+
142+
最终代码:
143+
144+
```sql
145+
SELECT
146+
cate,
147+
COUNT(videoid) n
148+
FROM
149+
(
150+
SELECT
151+
videoid,
152+
cate
153+
FROM
154+
(
155+
SELECT
156+
videoid,
157+
views,
158+
category
159+
FROM
160+
video_orc
161+
ORDER BY
162+
views DESC
163+
LIMIT 20 ) t1 LATERAL VIEW explode(category) tbl as cate ) t2
164+
GROUP BY
165+
cate
166+
ORDER BY
167+
n DESC;
168+
```
169+
170+
#### 4.4 统计视频观看数Top50所关联视频的所属类别排序
171+
172+
代码:
173+
174+
```sql
175+
SELECT
176+
DISTINCT t4.cate,
177+
t5.n
178+
FROM
179+
(
180+
SELECT
181+
explode(category) cate
182+
FROM
183+
(
184+
SELECT
185+
DISTINCT t2.videoid,
186+
v.category
187+
FROM
188+
(
189+
SELECT
190+
explode(relatedid) videoid
191+
FROM
192+
(
193+
SELECT
194+
videoid,
195+
views,
196+
relatedid
197+
FROM
198+
video_orc
199+
ORDER BY
200+
views DESC
201+
LIMIT 50 ) t1 ) t2
202+
JOIN video_orc v on
203+
t2.videoid = v.videoid ) t3 ) t4
204+
JOIN (
205+
SELECT
206+
cate,
207+
COUNT(videoid) n
208+
FROM
209+
(
210+
SELECT
211+
videoid,
212+
cate
213+
FROM
214+
video_orc LATERAL VIEW explode(category) tbl as cate) g1
215+
GROUP BY
216+
cate ) t5 ON
217+
t4.cate = t5.cate
218+
ORDER BY
219+
t5.n DESC;
220+
```
221+
222+
#### 4.5 统计每个类别中的视频热度Top10,以Music为例
223+
224+
思路:
225+
226+
(1)要想统计Music类别中的视频热度Top10,需要先找到Music类别,那么就需要将category展开,所以可以创建一张表用于存放categoryId展开的数据。
227+
228+
(2)向category展开的表中插入数据。
229+
230+
(3)统计对应类别(Music)中的视频热度。
231+
232+
最终代码:
233+
234+
创建表类别表:
235+
236+
```sql
237+
create table gulivideo_category(
238+
videoId string,
239+
uploader string,
240+
age int,
241+
categoryId string,
242+
length int,
243+
views int,
244+
rate float,
245+
ratings int,
246+
comments int,
247+
relatedId array<string>)
248+
row format delimited
249+
fields terminated by "\t"
250+
collection items terminated by "&"
251+
stored as orc;
252+
```
253+
254+
向类别表中插入数据:
255+
256+
```sql
257+
insert into table gulivideo_category
258+
select
259+
videoId,
260+
uploader,
261+
age,
262+
categoryId,
263+
length,
264+
views,
265+
rate,
266+
ratings,
267+
comments,
268+
relatedId
269+
from
270+
gulivideo_orc lateral view explode(category) catetory as categoryId;
271+
```
272+
273+
统计Music类别的Top10(也可以统计其他)
274+
275+
```sql
276+
select
277+
videoId,
278+
views
279+
from
280+
gulivideo_category
281+
where
282+
categoryId = "Music"
283+
order by
284+
views
285+
desc limit
286+
10;
287+
```
288+
289+
#### 4.6 统计每个类别视频观看数Top10
290+
291+
最终代码:
292+
293+
```sql
294+
SELECT
295+
cate,
296+
videoid,
297+
views
298+
FROM
299+
(
300+
SELECT
301+
cate,
302+
videoid,
303+
views,
304+
RANK() OVER(PARTITION BY cate
305+
ORDER BY
306+
views DESC) hot
307+
FROM
308+
video_category ) t1
309+
WHERE
310+
hot <= 10;
311+
```
312+
313+
314+

0 commit comments

Comments
 (0)