Python模拟登录实战,采集整站表格数据( 三 )

我看了一下,一共49473行——也就是说至少要发送49473个post请求才能爬完全部数据,纯手工获取的话大概要点击十倍这个数字的次数……
正式开始那么开始爬咯
import pyodbcwith open("region_state.json") as json_file:    region_state = json.load(json_file)data = https://www.isolves.com/it/cxkf/yy/Python/2020-08-19/pd.read_csv('remain.csv')# 读取已经爬取的cnxn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};' 'DBQ=./ctic_crm.accdb')crsr = cnxn.cursor()crsr.execute('select Year_, Region, State, County from ctic_crm')done = crsr.fetchall()done = [list(x) for x in done]done = pd.DataFrame([list(x) for x in done], columns=['CRMSearchForm[year]', 'CRMSearchForm[region]', 'CRMSearchForm[state]', 'CRMSearchForm[county]'])done['CRMSearchForm[year]'] = done['CRMSearchForm[year]'].astype('int64')state2st = {y: x for z in region_state.values() for x, y in z.items()}done['CRMSearchForm[state]'] = [state2st[x] for x in done['CRMSearchForm[state]']]# 排除已经爬取的remain = data.append(done)remain = remain.drop_duplicates(keep=False)total = len(remain)print(f'{total} left.n')del data # %%remain['CRMSearchForm[year]'] = remain['CRMSearchForm[year]'].astype('str')columns = ['Crop', 'Total_Planted_Acres', 'Conservation_Tillage_No_Till', 'Conservation_Tillage_Ridge_Till', 'Conservation_Tillage_Mulch_Till', 'Conservation_Tillage_Total', 'Other_Tillage_Practices_Reduced_Till15_30_Residue', 'Other_Tillage_Practices_Conventional_Till0_15_Residue']fields = ['Year_', 'Units', 'Area', 'Region', 'State', 'County'] + columnsdata = {'CRMSearchForm[format]': 'Acres', 'CRMSearchForm[area]': 'County', 'CRMSearchForm[crop_type]': 'All', 'summary': 'county'}headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/74.0.3729.131 Safari/537.36', 'Host': 'www.ctic.org', 'Upgrade-Insecure-Requests': '1', 'DNT': '1', 'Connection': 'keep-alive'}url = 'https://www.ctic.org/crm?tdsourcetag=s_pctim_aiomsg'headers2 = headers.copy()headers2 = headers2.update({'Referer': url, 'Origin': 'https://www.ctic.org'})def new(): session = requests.Session() response = session.get(url=url, headers=headers) html = etree.HTML(response.text) _csrf = html.xpath('/html/head/meta[3]/@content')[0] return session, _csrfsession, _csrf = new()for _, row in remain.iterrows(): temp = dict(row) data.update(temp) data.update({'_csrf': _csrf}) while True: try: response = session.post(url, data=data, headers=headers2, timeout=15) break except Exception as e: session.close() print(e) print('nSleep 30s.n') time.sleep(30) session, _csrf = new() data.update({'_csrf': _csrf}) df = pd.read_html(response.text)[0].dropna(how='all') df.columns = columns df['Year_'] = int(temp['CRMSearchForm[year]']) df['Units'] = 'Acres' df['Area'] = 'County' df['Region'] = temp['CRMSearchForm[region]'] df['State'] = region_state[temp['CRMSearchForm[region]']][temp['CRMSearchForm[state]']] df['County'] = temp['CRMSearchForm[county]'] df = df.reindex(columns=fields) for record in df.itertuples(index=False): tuple_record = tuple(record) sql_insert = f'INSERT INTO ctic_crm VALUES {tuple_record}' sql_insert = sql_insert.replace(', nan,', ', null,') crsr.execute(sql_insert) crsr.commit() print(total, row.to_list()) total -= 1else: print('Done!') crsr.close() cnxn.close()注意中间有个try...except..语句,是因为不定时会发生Connection aborted的错误,有时9000次才断一次,有时一次就断,这也是我加上了读取已经爬取的和排除已经爬取的原因,而且担心被识别出爬虫,把headers写的丰富了一些(好像并没有什么卵用),并且每次断开都暂停个30s并重新开一个会话

Python模拟登录实战,采集整站表格数据

文章插图
 
然后把程序开着过了一个周末,命令行里终于打出了Done!,到Access里一看有816288条记录,心想:下次试试多线程(进程)和代理池 。


推荐阅读