1.数据采集

采集目标:世界杯成绩信息表:WorldCupsSummary

包含了所有21届世界杯赛事(1930-2018)的比赛主办国、前四名队伍、总参赛队伍、总进球数、现场观众人数等汇总信息,包括如下字段:Year: 举办年份,HostCountry: 举办国家,Winner: 冠军队伍,Second: 亚军队伍,Third: 季军队伍,Fourth: 第四名队伍,GoalsScored: 总进球数,QualifiedTeams: 总参赛队伍数,MatchesPlayed: 总比赛场数,Attendance: 现场观众总人数,HostContinent: 举办国所在洲,WinnerContinent: 冠军国家队所在洲等,采集后如下图数据采集目标字段截图所示。

采集流程:

第一步,导入所需要的库,如requests库、json库、pandas库等,代码展示如图所示:

import requests

import json

import time

import pandas as pd

import random

第二步,确定翻页方法,确定url和请求参数,发起请求与响应,代码展示如图所示:

df = []

i = 0

# API

url = 'https://api.bilibili.com/x/space/arc/search?'

# 请求头参数

headers = {

'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67',

'refer': 'https://www.bilibili.com/',

}

第三步,发起请求,确定网站返回值正常,网站可以正常爬取,代码展示如图所示:

response = requests.get(url, headers=headers, params=params)

第四步,定义一个列表,利用requests库发送请求,由于我们要收集不同网页的数据,再利用每一页网址存在一定差异的特点进行分页爬取,定位列表爬取的内容,代码展示如下图所示:

for value in info_list:

try:

RoundID = value['RoundID']

except:

RoundID = ""

try:

MatchID = value['MatchID']

except:

MatchID = ''

try:

Team_Initials = value['Team_Initials']

except:

Team_Initials = ''

try:

Coach_Name = value["Coach_Name"]

except:

Coach_Name = ''

try:

Line_up = value['Line_up']

except:

Line_up = ''

try:

Shirt_Number = value['Shirt_Number']

except:

Shirt_Number = ''

try:

Player_Name = value['Player_Name']

except:

Player_Name = 0

try:

Position= value['Position']

except:

Position = '无'

try:

Event = value['Event']

except:

Event = 0

df.append([RoundID,MatchID,Team_Initials,Coach_Name,Line_up,Shirt_Number,Player_Name,Position,Event])

第五步,加入时间函数,设置代码运行速度,使翻页速度下降,模拟成人工翻页,防止爬取速度过快,代码展示如图所示:

# 防止爬取速度过快

time.sleep(random.randint(1, 2))

第六步,在确定网站可以正常爬取的前提下,通过for循环获取信息,再将爬取的文件写入并保存至csv文件,代码展示如图所示:

if __name__=='__main__':

start = time.time()

# 设置需要爬取的页数

num = 3

list = [1935882,351498044,390461123,99157282,60384544,9008159,8960728,71851529]

for value in list:

pau = get_info(url,headers,num,value)

end = time.time()

# 数据存储

df = pd.DataFrame(df,columns=["RoundID", "MatchID", "Team Initials","Coach Name","Line-up","Shirt Number","Player Name","Position","Event"])

df.to_csv('WorldCupsSummary', encoding='gb18030', index=False)

这样,运行完整代码,就可以在csv文件中找到我们获取到的主页信息,并对它们进行有效分析,采集的部分数据截图如图所示:

2.数据预处理

2.1 数据预处理的目标:

(1)数据清洗

(2)缺失值处理