Each Answer to this Q is separated by one/two green lines.
I would like to get the same result as this command line :
scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json
My script is as follows :
import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start()
I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and last are None (printed in (2)), if so, then there is no point of creating the object spider and how is it possible to pass the arguments first and last to process.crawl()?
from logging import INFO import scrapy class LinkedInAnonymousSpider(scrapy.Spider): name = "linkedin_anonymous" allowed_domains = ["linkedin.com"] start_urls =  base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search" def __init__(self, input = None, first= None, last=None): self.input = input # source file name self.first = first self.last = last def start_requests(self): print self.first ## <------------- (2) if self.first and self.last: # taking input from command line parameters url = self.base_url % (self.first, self.last) yield self.make_requests_from_url(url) def parse(self, response): . . .
pass the spider arguments on the
process.crawl(spider, input="inputargument", first="James", last="Bond")
You can do it the easy way:
from scrapy import cmdline cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split())
if you have Scrapyd and you want to schedule the spider, do this
curl http://localhost:6800/schedule.json -d project=projectname -d spider=spidername -d first="James" -d last="Bond"