I have a Spider that scrapes data which cannot be saved in one item class.
For illustration, I have one Profile Item, and each Profile Item might have an unknown number of Comments. That is why I want to implement Profile Item and Comment Item. I know I can pass them to my pipeline simply by using yield.
However, I do not know how a pipeline with one parse_item function can handle two different item classes?
Or is it possible to use different parse_item functions?
Or do I have to use several pipelines?
Or is it possible to write an Iterator to a Scrapy Item Field?
comments_list=[]
comments=response.xpath(somexpath)
for x in comments.extract():
comments_list.append(x)
ScrapyItem['comments'] =comments_list
By default every item goes through every pipeline.
For instance, if you yield a ProfileItem
and a CommentItem
, they'll both go through all pipelines. If you have a pipeline setup to tracks item types, then your process_item
method could look like:
def process_item(self, item, spider):
self.stats.inc_value('typecount/%s' % type(item).__name__)
return item
When a ProfileItem
comes through, 'typecount/ProfileItem'
is incremented. When a CommentItem
comes through, 'typecount/CommentItem'
is incremented.
You can have one pipeline handle only one type of item request, though, if handling that item type is unique, by checking the item type before proceeding:
def process_item(self, item, spider):
if not isinstance(item, ProfileItem):
return item
# Handle your Profile Item here.
If you had the two process_item
methods above setup in different pipelines, the item will go through both of them, being tracked and being processed (or ignored on the second one).
Additionally you could have one pipeline setup to handle all 'related' items:
def process_item(self, item, spider):
if isinstance(item, ProfileItem):
return self.handleProfile(item, spider)
if isinstance(item, CommentItem):
return self.handleComment(item, spider)
def handleComment(item, spider):
# Handle Comment here, return item
def handleProfile(item, spider):
# Handle profile here, return item
Or, you could make it even more complex and develop a type delegation system that loads classes and calls default handler methods, similar to how Scrapy handles middleware/pipelines. It's really up to you how complex you need it, and what you want to do.