Python

Scraping With Scrapy : Part 1

This is gonna be a three part series.

Part 1 is about instructions to install Scrapy and starting your first project.
Let’s get started.

What is Scrapy?

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. It’s all in Python. Read more here.

 

Installing Scrapy

1.   Install gcc and lxml.

sudo apt-get install python-dev
sudo apt-get install libevent-dev
sudo apt-get install libxml2 libxml2-dev
sudo apt-get install libxml2-dev libxslt-dev
sudo apt-get install python-lxml

2.   Install twisted

sudo apt-get install python-twisted python-libxml2 python-simplejson

3.   Install pyOpenSSL

wget http://pypi.python.org/packages/source/p/pyOpenSSL/pyOpenSSL-0.13.tar.gz
tar -zxvf pyOpenSSL-0.13.tar.gz
cd pyOpenSSL-0.13
sudo python setup.py install

#If any error like gcc exit status 1 pops then :
sudo apt-get update
sudo apt-get install yum rpm

#then
sudo yum install python-devel libxml2-devel libxslt-devel
sudo yum install pyOpenSSL

#or
sudo apt-get install libssl-dev

4.  Install pycrypto

wget http://pypi.python.org/packages/source/p/pycrypto/pycrypto-2.5.tar.gz
tar -zxvf pycrypto-2.5.tar.gz
cd pycrypto-2.5
sudo python setup.py install

5.   Install easy_install:(if you don’t have easy_install)

wget http://peak.telecommunity.com/dist/ez_setup.py
python ez_setup.py

6.   Install w3lib

sudo easy_install -U w3lib

7.   Install scrapy

sudo easy_install Scrapy

 

Creating a project in Scrapy

  scrapy startproject my_first_project 

The directory structure will look like :

my_first_project/
    scrapy.cfg
    my_first_project/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

 

So you are done with the installation and have also started a new project in Scrapy.

I’ll be covering how to write a simple spider and crawl spider (which recursively crawls website), in the Part 2 of this Scrapy series.

Some useful links : Scrapper? , Web Crawler?.

 

 

 

Advertisements

3 thoughts on “Scraping With Scrapy : Part 1

  1. Hello Tapasweni,
    Your blog is very helpful and I am very exited to follow and see if it works.
    I am new to python and trying to learn scraping on windows 7. But when I tried your first step, i.e sudo apt-get install python-dev I got the error saying sudo is not recognized. When I searched, I got to know that it is a linux command. Can you tell me how to use this command with PIP.
    (No luck when I tried with pip install python-dev)

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s