场景
有些场景,为了方便、高效,需要脱离scrapy框架使用spalsh
配置
代理:隧道代理为佳
在宿主机上找个位置,新建文件/root/splash/proxy-files/cip.ini 注意:区别于官方文档,ini应该为小写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| [proxy]
host=你的配置 port=你的配置
username=你的配置 password=你的配置
type=HTTP
[rules]
whitelist= .*cip.cc.*
blacklist= .*.js.* .*.css.* .*.png
|
docker启动spalsh
1 2 3 4
| [root@host proxy-files]# docker run -p 8050:8050 -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash
# 后台启动 docker run -d -p 8050:8050 --restart=always -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash
|
启动后日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| 2022-10-13 04:22:24+0000 [-] Log opened. 2022-10-13 04:22:24.608906 [-] Xvfb is started: ['Xvfb', ':322410884', '-screen', '0', '1024x768x24', '-nolisten', 'tcp'] QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash' 2022-10-13 04:22:24.718768 [-] Splash version: 3.5 2022-10-13 04:22:24.774528 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2 2022-10-13 04:22:24.774753 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] 2022-10-13 04:22:24.774878 [-] Open files limit: 1048576 2022-10-13 04:22:24.774946 [-] Can't bump open files limit 2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 2022-10-13 04:22:24.806578 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled 2022-10-13 04:22:24.969678 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0 2022-10-13 04:22:24.970001 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled 2022-10-13 04:22:24.970452 [-] Site starting on 8050 2022-10-13 04:22:24.970553 [-] Starting factory <twisted.web.server.Site object at 0x7f729c0f4550> 2022-10-13 04:22:24.970848 [-] Server listening on http://0.0.0.0:8050
|
其中 2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 标识代理生效
使用:
访问cip.cc,查看此时的ip
1
| curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip'
|
返回结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| <html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">Too Many Request </pre></body></html>% ➜ ~ curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip' <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head> <title>IP查询 - 查IP(www.cip.cc)</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="查IP(www.cip.cc)网站, 提供免费的IP查询服务,命令行查询IP, 并且支持'PC网站, 手机网站, 命令行(Windows/UNIX/Linux)' 三大平台, 是个多平台的IP查询网站, 更新即使, 数据准确是我们的目标"> <meta name="keywords" content="IP, 查IP, IP查询"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta content="width=device-width,initial-scale=1" name="viewport"> <link rel="icon" href="data:;base64,="> <link href="//static.cip.cc/static/styles.min.css?v=15" rel="stylesheet"> <script src="https://hm.baidu.com/hm.js?6c34da399cbcfbb71d86c72215942759"></script><script type="text/javascript" src="//static.cip.cc/static/js.min.js?v=6"></script> </head> <body>
<div class="wrapper"> <div class="page"> <div class="logo"> <h1> <strong>多平台的命令行IP查询</strong> <a href="//www.cip.cc/" title="手机, 命令行IP查询"><img src="//static.cip.cc/static/img/logo.png?v=2" alt="手机, 命令行IP查询"></a> </h1> </div> <div class="search"> <form action="/" onsubmit="return query();"> <table> <tbody> <tr> <td style=" width: 75%; "> <input id="data-input" placeholder="请输入要查询的 IP 地址" size="26" type="text"> </td> <td> <input id="data-submit" type="submit" class="kq-button" value="查询"> </td> </tr> </tbody> </table> </form> </div> <div class="data kq-well"> <pre>IP : 182.87.15.14 地址 : 中国 江西 鹰潭 运营商 : 电信
数据二 : 江西省鹰潭市 | 电信
数据三 : 中国江西省鹰潭市 | 电信
URL : http://www.cip.cc/182.87.15.14 </pre>
|
可以看到,ip已经更换
Python requests demo
1 2 3 4 5 6 7
| import requests
target_url = "http://cip.cc" url = f'http://{你的服务器地址}:8050/render.html?url={target_url}&proxy=cip'
response = requests.get(url) print(response.text)
|
一个需求案例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| import time import requests from concurrent.futures import ThreadPoolExecutor from retrying import retry
@retry(stop_max_attempt_number=3) def req_taobao(): target_url = "https://shop551707528.taobao.com/search.htm?search=y&orderType=hotsell_desc&&pageNo=1" url = f'http://10.0.19.90:8050/render.html?url={target_url}&proxy=taobao&wait=5'
response = requests.get(url, timeout=5) print(response.text)
"""开始爬虫""" s_time = time.time() executor = ThreadPoolExecutor(5) for i in range(10): executor.submit(req_taobao) executor.shutdown() print(time.time() - s_time)
|
相关文档