In this post I’ll be writing more about the files created when you create a new Scrapy Project, about how to write a simple spider and a crawl spider.
Simple spider (or Base Spider) will simply scrap the data, you need, from a single webpage whereas Crawl Spider recursively crawls through all the webpages.
Something about files and subdirectory created is below :
This file has the fields that you want to scrap and store from a website.
Some settings, like allowing redirection, 404 while scraping.
In this you will store your spider. You can name your spider anything.
You filter your data by adding your code in this files. Your drop duplicates by adding some lines of code in this file.
What is xpath?
In simple terms it’s the path to the data which you need to extract. Read and learn more here. You can also get the xpath of a some content in the webpage by following these steps :
1. Go to the developer tools in your browser.
2. Move to the Elements tab, and right click on the data and copy the xpath.
3. You can use it in your spider with some minor changes. Check in the console tab before using or for debugging purpose.
When this is enough as a guide, tutorial, I’ll be adding what I have learnt after writing a good numbers of scrapper/parsers.
Let’s write a simple crawler.
items.py : Let’s edit this file to store two fields, heading and it’s content. The file will look like this.
settings.py and pipelines.py : As we are writing our first scrapper, so let it be very simple. After all we are simply scrapping a single page.
ScrapySpider.py : Now let’s create our first spider in spiders folder.
from scrapy.selector import Selector from scrapy.spider import BaseSpider from ScrapScrapy.items import ScrapscrapyItem #Base spider crawls over a single page and not the subsequent links present on the first page class ScrapyscrapSpider (BaseSpider) : name = "ss" allowed_domains = ["scrapy.org"] start_urls = ['http://scrapy.org/'] #This will be called automatically def parse(self, response) : sel = Selector (response) #Making an object of our class present in items.py item = ScrapscrapyItem () #Our two fields present in items.py, here we are assigning values to them item['Heading'] = str (sel.xpath ('//*[@id="content"]/h3/text()').extract ()) + str (sel.xpath ('//*[@id="content"]/dl/h3/text()').extract ()) item['Content'] = str (map (unicode.strip, sel.xpath ('//*[@id="content"]/p/text()').extract ())) item['Content'] += str (map (unicode.strip, sel.xpath ('//dl//text()').extract ())) #Return what is fetched, you can also yield return item
So now we are done with our coding part. Let’s run our spider.
Move to the base folder / Main folder. Run the command given below.
scrapy crawl spider-name -o output_file_name.csv -t csv
This will store the content fetched in a csv file. View what you have scrapped. Sample file is here.
Now you are done with your crawler which crawls a single page.
Let’s write a crawl spider which recursively crawls pages.
It’s also simple. You need to simply add some rules to recognize the links, which you want your crawler to crawl further.
rules = ( Rule (SgmlLinkExtractor (allow=".*", ), unique=True), callback='parse_item', follow=True), )
Other than this you need to change the function name from parse to parse_item (or some any other name). As internally it calls parse, and you need to define your own function to use, when the rules satisfies.
One more change that you need to do in your previous spider code is that, you need to use crawlspider, instead of Base Spider.
Do not forget to change the xpaths and fields that you need to scrap.
You need to add links or regex in the allow part. You set unique as true so that crawler does not repeatedly crawls the same links again and again.
The complete code is here.
Now go and scrap some website of your choice. Let me know here, if something irritates you in the process.
In the next post I’ll be covering some advanced stuff in Scrapy like using Selenium with Scrapy, Changing the templates which gets loaded when you create a new Scrapy project e.t.c.
Stay tuned. 🙂