Skip to content
This repository has been archived by the owner on Jun 10, 2024. It is now read-only.

Releases: binux/pyspider

v0.3.10

18 Apr 04:25
Compare
Choose a tag to compare

New features:

Fix several bugs:

  • Improve the performance of counter.to_dict
  • Fixed issue of counter changed during read
  • Fix tornado version dependency in setup.py

v0.3.9

18 Mar 21:00
Compare
Choose a tag to compare

New features:

  • Support for Python 3.6.
  • Auto Pause: the project will be paused for scheduler.PAUSE_TIME (default: 5min) when last scheduler.FAIL_PAUSE_NUM (default: 10) task failed, and dispatch scheduler.UNPAUSE_CHECK_NUM (default: 3) tasks after scheduler.PAUSE_TIME. Project will resume if any one of last scheduler.UNPAUSE_CHECK_NUM tasks success.
  • Each callback now have a default 30s process time limit. (Platform support required) @beader
  • New Javascript render engine - Splash support: Enabled by fetch argument --splash-endpoint=http://splash:8050/execute
  • Python3 webdav support.
  • Python3 from projects import project support.
  • A link to corresponding task is added to webui debug page when debugging a exists task in webui.
  • New user_agent parameter in self.crawl, you can set user-agent by headers though.

Fix several bugs:

  • New webui dashboard frontend framework - vue.js, improved the performance when having large number of tasks (e.g. http://demo.pyspider.org/)
  • Fix crawl_config doesn't work in webui while debugging a script issue.
  • Fix CSS Selector Helper doesn't work issue. @ackalker
  • Fix connection_timeout not working issue.
  • FIx need_auth option not applied on webdav issue.
  • Fix "fix can't dump counter to file: scheduler.all" error.
  • Some other fixes

v0.3.8

18 Aug 20:06
Compare
Choose a tag to compare

New features:

Fix several bugs:

  • * Fixed a global config object thread interference issue, which may cause connect to scheduler rpc error: error(10061, '') error when all --run-in=thread (default in windows platform)
  • Fix response.save lost when fetch failed issue
  • Fix potential scheduler failure caused by old version of six
  • Fix result dump return nothing when using mongodb backend

v0.3.7

20 Apr 20:42
Compare
Choose a tag to compare

retry_delay is a dict to specify retry intervals. The items in the dict
are {retried: seconds}, and a special key: '' (empty string) is used to
specify the default retry delay if not specified.

  • dict parameters in crawl_config, @config will be merged (e.g. headers), thanks to @ihipop
  • add parameter max_redirects in self.crawl to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiK
  • add parameter validate_cert in self.crawl to ignore the error of server’s certificate.
  • new property etree for Response, etree is a cached lxml.html.HtmlElement object, thanks to @waveyeung
  • you can now pass arguments to phantomjs from command line or config file.
  • support for pymongo 3.0
  • local.projectdb now accept a glob path (e.g. script/*.py) to load multiple projects from local filesystem.
  • queue size in the dashboard is not working for osx, thanks to @xyb
  • counters in dashboard will shown for stopped projects
  • other bug fix

v0.3.6

10 Nov 00:33
Compare
Choose a tag to compare
  • NEW: webdav mode, now you can use webdav to mount project folder to your local filesystem and edit scripts with your favority editor! (not support python 3, wsgidav required, which is not contained in setup.py)
  • bug fixes for Python 3 compatibility, Postgresql, flask-Login>=0.3.0, typo and more, thanks for the help of @lushl9301 @hitjackma @exoticknight @d0ugal @qiang.luo @twinmegami @jttoday @machinewu @littlezz @yaokaige
  • fix Queue.qsize NotImplementedError on Mac OS X, thanks @xyb

v0.3.5

22 May 16:02
Compare
Choose a tag to compare
  • New parameter: auto_recrawl - auto restart task every age.
  • New parameter: js_viewport_width/js_viewport_height to set viewport size for phantomjs engine.
  • New command line option to set different message queue backends with URI scheme.
  • New task level storage mechanism: self.save
  • New redis taskdb
  • New redis message queue.
  • New high level message queue interface kombu.
  • Fix bugs related to mongodb (keyword missing if not set).
  • Fix phantomjs not work in all mode.
  • Fix a potential deadlock in processor send_message.
  • Default log level of scheduler is changed to INFO

v0.3.4

21 Apr 15:01
Compare
Choose a tag to compare

Global

  • New message queue support: beanstalkd by @tiancheng91
  • New global argument: --logging-config to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).
  • Project group info is added to task package now.
  • Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.
  • Auto restart phantomjs if crash, only enabled in all mode by default.

WebUI

  • Show next exetime of a task in task page.
  • Show fetch time and process time in tasks page.
  • Show average fetch time and process time in 5min in dashboard page.
  • Show message queue status in dashboard page.
  • limit and offset parameter support in result dump.
  • Fix frontend bug when crawling pages with dataurl.

Other

  • Fix support for phantomjs 2.0.
  • Fix scheduler project update inform not work, and use md5sum of script as another signal.
  • Scheduler: periodic counter report in log.
  • Fetcher: fix for legacy version of pycurl

v0.3.3

08 Mar 13:32
Compare
Choose a tag to compare

API

WEBUI

  • A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.

Benchmarking

  • The database table for bench test will be cleared before and after bench test.
  • insert/update/get bench test for database and put/get test for message queue is added.

Other

  • The default message queue is switched to ampq.
  • docs fix.

v0.3.2

11 Feb 16:33
Compare
Choose a tag to compare

Scheduler

  • The size of task queue is more accurate now, you can use it to determine all done status of scheduler.

Fetcher

  • Fix tornado loss cookies while doing 30x redirects
  • You can use cookies with cookie header at same time now
  • Fix proxy not working bug.
  • Enable proxy by default.
  • Proxy now support username and password authorization. @soloradish
  • Etag and Last-Modified header will be disabled while last crawl is failed.

Databases

  • MySQL default engine changed to InnoDB @laapsaap
  • MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) @laapsaap

WebUI

  • WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.
  • Results will be sorted in the order of updatetime.

One Mode

  • Script exception logs would be printed to screen

New Command send_message

You can use the command pyspider send_message [project] [message] to send a message to project via command-line.

Other

  • Using localhosted test web pages
  • Remove version specify of lxml, you can use apt-get to install any version of lxml

v0.3.1

22 Jan 15:59
Compare
Choose a tag to compare

One Mode

One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive to choose a task to be tested.

With one mode you can use pyspider.libs.utils.python_console() to open an interactive shell in your script context to test your code.

full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one

  • bug fix