Python模拟登录实战，采集整站表格数据( 三 ) _Python模拟登录

我看了一下，一共49473行——也就是说至少要发送49473个post请求才能爬完全部数据，纯手工获取的话大概要点击十倍这个数字的次数……
正式开始那么开始爬咯
import pyodbcwith open("region_state.json") as json_file: region_state = json.load(json_file)data = https://www.isolves.com/it/cxkf/yy/Python/2020-08-19/pd.read_csv('remain.csv')# 读取已经爬取的cnxn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};' 'DBQ=./ctic_crm.accdb')crsr = cnxn.cursor()crsr.execute('select Year_, Region, State, County from ctic_crm')done = crsr.fetchall()done = [list(x) for x in done]done = pd.DataFrame([list(x) for x in done], columns=['CRMSearchForm[year]', 'CRMSearchForm[region]', 'CRMSearchForm[state]', 'CRMSearchForm[county]'])done['CRMSearchForm[year]'] = done['CRMSearchForm[year]'].astype('int64')state2st = {y: x for z in region_state.values() for x, y in z.items()}done['CRMSearchForm[state]'] = [state2st[x] for x in done['CRMSearchForm[state]']]# 排除已经爬取的remain = data.append(done)remain = remain.drop_duplicates(keep=False)total = len(remain)print(f'{total} left.n')del data # %%remain['CRMSearchForm[year]'] = remain['CRMSearchForm[year]'].astype('str')columns = ['Crop', 'Total_Planted_Acres', 'Conservation_Tillage_No_Till', 'Conservation_Tillage_Ridge_Till', 'Conservation_Tillage_Mulch_Till', 'Conservation_Tillage_Total', 'Other_Tillage_Practices_Reduced_Till15_30_Residue', 'Other_Tillage_Practices_Conventional_Till0_15_Residue']fields = ['Year_', 'Units', 'Area', 'Region', 'State', 'County'] + columnsdata = {'CRMSearchForm[format]': 'Acres', 'CRMSearchForm[area]': 'County', 'CRMSearchForm[crop_type]': 'All', 'summary': 'county'}headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/74.0.3729.131 Safari/537.36', 'Host': 'www.ctic.org', 'Upgrade-Insecure-Requests': '1', 'DNT': '1', 'Connection': 'keep-alive'}url = 'https://www.ctic.org/crm?tdsourcetag=s_pctim_aiomsg'headers2 = headers.copy()headers2 = headers2.update({'Referer': url, 'Origin': 'https://www.ctic.org'})def new(): session = requests.Session() response = session.get(url=url, headers=headers) html = etree.HTML(response.text) _csrf = html.xpath('/html/head/meta[3]/@content')[0] return session, _csrfsession, _csrf = new()for _, row in remain.iterrows(): temp = dict(row) data.update(temp) data.update({'_csrf': _csrf}) while True: try: response = session.post(url, data=data, headers=headers2, timeout=15) break except Exception as e: session.close() print(e) print('nSleep 30s.n') time.sleep(30) session, _csrf = new() data.update({'_csrf': _csrf}) df = pd.read_html(response.text)[0].dropna(how='all') df.columns = columns df['Year_'] = int(temp['CRMSearchForm[year]']) df['Units'] = 'Acres' df['Area'] = 'County' df['Region'] = temp['CRMSearchForm[region]'] df['State'] = region_state[temp['CRMSearchForm[region]']][temp['CRMSearchForm[state]']] df['County'] = temp['CRMSearchForm[county]'] df = df.reindex(columns=fields) for record in df.itertuples(index=False): tuple_record = tuple(record) sql_insert = f'INSERT INTO ctic_crm VALUES {tuple_record}' sql_insert = sql_insert.replace(', nan,', ', null,') crsr.execute(sql_insert) crsr.commit() print(total, row.to_list()) total -= 1else: print('Done!') crsr.close() cnxn.close()注意中间有个try...except..语句，是因为不定时会发生Connection aborted的错误，有时9000次才断一次，有时一次就断，这也是我加上了读取已经爬取的和排除已经爬取的原因，而且担心被识别出爬虫，把headers写的丰富了一些（好像并没有什么卵用），并且每次断开都暂停个30s并重新开一个会话

文章插图

然后把程序开着过了一个周末，命令行里终于打出了Done!，到Access里一看有816288条记录，心想：下次试试多线程（进程）和代理池。

Python模拟登录实战，采集整站表格数据( 三 )

推荐阅读

#短发#夏天和短发更配，这5款好看的短发发型，不仅减龄还显脸小

梦情影娱计划|99％已经喜欢你了，傻白甜要好好珍惜，男生常给你这“3回复”

伊朗高级指挥官遭美军击毙鲁哈尼誓言将发起报复

延迟退休：2020年“延迟退休”实锤？“新提案”再次被确认，2类人要吃亏？

生活泛观察|鲜香可口超好吃，新手也能快速上手，简单易学的几道美味家常菜

和拒绝了自己的女生恢复到好朋友关系还有机会吗?

@东方日式庭院设计，源于中国的空间美韵

紫微黄历网|宜出行，9月13日黄历宜忌

央视|张文宏：当下要用非医疗手段对抗疫情

垮组词语和拼音垮组词

心态|酷+设计先锋｜杨可：设计人的空杯心态

孕妈咪育儿经|产妇自己使劲儿就够了吗？不，三种“助产神器”有妙用，生孩子时

青衫负雪|那些让你一下子明白人生意义的句子！！

深富策略是正规的吗？底部涨停后洗盘是什么原因？

课件制作的技巧是什么呢课件制作的技巧是什么

『体重』想减肥的人，4个迹象的出现，说明你正在减脂，已在变瘦的路上了

对联从左还是右读起对联从左边读还是右边

如何在家制作蒸馏水

陈慧琳|陈慧琳家事闹上法庭，婆婆怒告亲女儿，向法院申请禁止对方进家门

挂天官赐福的禁忌天官赐福安放位置禁忌