Skip to content

ibrahimgunduz34/crawleme

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is CrawleMe! ?

CrawleMe! is is easy way of crawling image or link urls from any web site.

How It Works ?

Create your web page wrapper class.

from crawleme.base import BasePage

class MyPage(BasePage):
	url = 'http://www.mysite.com'
	item_path = '//*[@id="campaign_list"]/div/a'
	item_attribute = 'href'

Create a instance of wrapper class and call crawle method.

crawler = MyPage()
urls = crawler.crawle()

for url in urls:
	print url

Result:

http://www.mysite.com/id/5
http://www.mysite.com/aboutus/
http://www.mysite.com/foo/
http://www.mysite.com/bar/
http://www.mysite.com/baz/

Also, you can pass or override the url or item_path of wrapper class on creating class instance.

crawler = MyPage(url='http://www.mysite.com/id/112312')

Properties:

url:
Url of page that will be crawled.

item_path:
X-Path of selected DOM element(s).

item_attribute:
Attribute of selected DOM element(s).

has_only_single_item (default=False):
crawle method returns only single value when there is True

fix_urls (default=True):
Sometimes may be DOM object attributes contains only path value without hostname and protocol. This attributes fix the parsed value as full url.

Methods:

crawle([timeout=crawleme.conf.REQUEST_TIMEOUT],[renew=False]):
Parses value list or single value from the page by the specified attributes.

get_filename([timeout=crawleme.conf.REQUEST_TIMEOUT]):
Returns requested filename.

read([timeout=crawleme.conf.REQUEST_TIMEOUT]):
read data from stream.

About

CrawleMe! is is easy way of crawling image or link urls from any web site.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages