Web Scraping of Psychological Data with Scrapy

This tutorial accompanies an article:
Landers, R. N., Brusso, R. C., Cavanaugh, K. J. & Collmus, A. B. (2016). A primer on theory-driven web scraping: Automatic extraction from the Internet for use in psychological research. Psychological Methods, 21, 475-492.You can also cite this tutorial as:
Landers, R. N. (2016). Web scraping of psychological data: Getting started with Scrapy. Retreived from http://rlanders.net/scrapy/

List of Steps and Shortcuts:

  1. Installing Software
  2. Creating a New Scrapy Project
  3. Understanding the Folder Structure
  4. Building a Scraper to Harvest Data from a Webpage
  5. Building a Crawler/Spider to Scrap Multiple Webpages
  6. Saving Time During Debugging with Scrapy Shell
  7. Sample Project
  8. Sample Project Solution
  9. More Complex Spiders and Scrapers
  10. Accessing Databases through an API

Installing Software

  1. Install Python 2.7.x as the base programming language in which you will be writing:https://www.python.org/downloads/ – when doing this, ensure you check the box to add Python to your command path
  2. Install PyCharm Community Edition as your Python IDE (integrative development environment – this is software that gives you feedback on the quality of your code as you write it and helps track files, similar to R Studio when used with R): https://www.jetbrains.com/pycharm/
  3. At this point, if programming Python is not something you know how to do, it is recommended (but not required) to learn Python independently of this tutorial. This is not as difficult as it might seem now, although it does take some time. We recommend the interactive tutorial provided by Codecademy, which you can find at https://www.codecademy.com/learn/python. You can use Pycharm as your IDE for the Codecademy tutorials.
  4. Once installed, open PyCharm, and create a blank project. Ensure this project is saved in a base folder where you will keep all of your scraping projects (e.g., create it in C:\Scrapy Projects\”).
  5. Once the blank project is open, on the bottom left corner of the PyCharm interface, click the menu button and select Python Console so that Console and Terminal windows appear at the bottom
  6. Click on Terminal to select it. Here, you will need to use the PIP software (“pip installs packages”) which came with Python to install software packages that extend Python’s capabilities beyond its core functioning (same principle as packages in the statistical software R). Run the following command in the Terminal window and wait for it to complete successfully.
    pip install scrapy
    If this command throws a “file not found” error, you probably didn’t check the box to add Python to your command path. There is a way to fix this problem, but if you aren’t fluent in the Windows command line, it is probably easier to uninstall Python at this stage and start the installation process over again.
  7. Once Scrapy is installed, enter the following command in the Terminal:
    scrapy
    You should see about 20 lines of information, the first of which (you might need to scroll up) will give you the version of Scrapy that is installed and should look something like this: scrapy 0.24.5 – no active projectNote that for this tutorial, we are using scrapy 0.24.5. If the version you use is newer/different, some of this tutorial may not work exactly as expected, in which case, you’ll need to figure out why, which will probably require some familiarity with Python.If you see errors, you will need to fix them before continuing. The easiest of these is if you see something like “ImportError: No Module named modulename“. If you see this, go to the Terminalwindow and type:
    pip install modulename
    Note that there is a bit of a rabbit hole here. If you try to pip install modulename and see another ImportError, you’ll need to pip install that module first, then go back and install the original missing module. Follow this pattern until “scrapy” runs without errors.

Creating a New Scrapy Project

  1. Once everything is installed correctly (no error messages), in the Terminal, type the following (replacing projectname with whatever you want the folder name to be for your first project – I suggest tutorial – but it can be any term that is all one lowercase word with no special characters including spaces):
    scrapy startproject projectname
    This will create a new folder inside the folder you created before (e.g., if you were in C:\ScrapyProjects\, this would create C:\ScrapyProjects\projectname) inside of which would be a template for a Scrapy project.For the remainder of these instructions, I’ll refer to your currently open project (whatever you named it above) as projectname.
  2. In PyCharm, close the blank project you’ve had open (File menu/Close Project). Click Create New Project and under Location, choose the folder you just created (in our example, c:\ScrapyProjects\projectname). When you are asked if a project should be created based upon the files already in those folders, say Yes. If all is successful, you should see a folder in the Project space (top left child window) called projectname which contains another folder called projectname and a file called scrapy.cfg. Expand the projectname folder and the spiders folder it contains to see all the files in your project.
  3. Double-click items.py, settings.py, and pipelines.py to open them as well.
  4. In settings.py, you will notice a line starting “#USER_AGENT”. Remove the # and replace the text inside the single quotes with a string that represents you. This is how Scrapy will identify itself to websites that you harvest. I recommend something like the following:
    USER_AGENT = 'Big State University research project (myemail@bsu.edu)'

Understanding the Folder Structure
The files that were created in the above process included different sets of commands that Scrapy will execute in order to run your scraping project, grouped by purpose. Broadly, in the order you will edit them, they are the following:

settings.py provides settings that are consistent across your entire project that Scrapy needs to access. You’ve already made a small change to this file.
projectname_spider.py , which you will create later, specifies how Scrapy will identify which webpages to download (crawling/spidering) and what to grab from them (scraping).
items.py specifies a list of the variables you want to extract. Ultimately, these will be the column names in your dataset.
pipelines.py specifies the “item pipeline”, which provides instructions on how each webpage will be processed to extract the data it contains (scraping).

You will send commands directly to Scrapy through the Terminal window you used before, which will in turn access these files to figure out what it should do.

Building a Scraper to Harvest Data from a Webpage
For this example, we will be scraping complete citations, author lists, and years of publication from tntlab.org. It is recommended you do the same for the first time you try to follow these instructions.

  1. Even if you ultimately plan to harvest data from multiple webpages, you will want to begin by testing your scraper on a single webpage. Once you are confident that your scraper works correctly, you will attach it to a crawler. We do it in this order because scraping a single webpage takes less than a second, whereas crawling an entire website a) takes a second or two per page and b) causes a web traffic and electricity cost spike to the website you are crawling. Since you want to put as little load on the server as possible (“crawl responsibly”), it is best practice to perfect your scraper before employing your crawler.To accomplish this, we’re going to first automatically generate a “simple” crawler to serve as a placeholder while we build our scraper. To do this, open the Terminal and type the following:
    scrapy genspider projectname_spider tntlab.org
    If successful, you will see a confirmation (“Created spider”) in the Terminal. After a moment, you will also see a file ( projectname_spider.py) appear in the spiders folder of your project. Double-click that file to open it in the editor.
  2. The first section of projectname_spider.py (beneath the class declaration but above thedef command) contains instructions explaining to Scrapy how to crawl. Right now, it is being instructed to only visithttp://tntlab.org, so that it the only webpage it will look at. Eventually, we will explore how to change this file to spider across many more webpages. But for now, we will harvest data from the frontpage of tntlab.org only.At this point, also be sure that there is a line that looks like this:
    name = "projectname"
    If it doesn’t look like that, change it now.
  3. Next, open items.py. This file gives Scrapy a list of the all the variables you ultimately want in your dataset. As specified above, we want three distinct pieces of information from each citation on tntlab.org: the complete citation itself, the list of authors for that citation, and the year of publication. Because there are three items we want, we will delete the word pass and replace it with three new lines:
    citation = scrapy.Field()
    authors = scrapy.Field()
    year = scrapy.Field()

    There are certain naming conventions for variable names in Python which you should follow. First, do not use capital letters. Second, do not use the name of a command that already exists in Python; you can tell you have done this if the word changes colors after you type it. If that happens, change it to something else!
  4. Now, return to projectname_spider.py. The second part of this file, contained under def parse(self, response):, is where you will write code for your scraper. Within this section, you need to instruct Scrapy which items to scrape, so you will need to write a little code.Pointers about this code:
    a) Remember to match capitalization throughout. Selector and selectorare not the same code.
    b) Note that Projectname now starts with a capital letter and is followed by Item, which is also capitalized. projectnameitem , Projectnameitem,projectnameItem, and ProjectnameItem are all different code.
    c) When you see indentions, either press the space bar four times exactly or press the tab key once. No other combination will work.To actually create your starter instructions, delete pass and replace it with the following code:
    pagecases = scrapy.Selector(response).xpath('')
    items = []
    for case in pagecases:
    item = ProjectnameItem()
    item[“citation”] = ”
    item[“authors”] = ”
    item[“year”] = ”
    items.append(item)

    return items
    At the top of the file, also add the following code below any import or from statements (probably below “import scrapy”):

    from projectname.items import ProjectnameItem
    from lxml import html
    import re
    
  5. To complete this code, we will need to dive into the source code of tntlab.org’s front page. This will require some basic knowledge of HTML and XPATH.Generally speaking, HTML is the base language of the web. How it is written determines how webpages are structured. Generally speaking, HTML is structured such that any webpage can be expressed as a series of nested virtual objects. For example, on tntlab.org, an HTML object contains 100% of the HTML that makes up that page. The HTML object contains a BODY, which contains all visible information on that page. The BODY object contains several DIV objects that specify the major divisions of information on that page. One of those DIV objects contains a series of P (paragraph) objects that represent each citation we’re trying to scrape. To scrape the citations we want, we need to explain to Scrapy precisely which P objects we’re after in a way that is consistent on every page we want to scrape. That is what needs to be done in code.If you are unfamiliar with HTML, it is recommended you pause this tutorial and complete the following: http://www.w3schools.com/html/default.aspIf you are unfamiliar with XPATH, it is recommend you pause this tutorial and complete the following AFTER completing the HTML tutorial: http://www.w3schools.com/xpath/
  6. To gather the information we need, we will need to develop one major XPATH that will specify the highest level of information that is consistent across everything we need to extract. In this case, we need an XPATH that points to the citation. Next, we will specify one XPATH for each variable we’re trying to extract within that larger XPATH.The easiest way to figure out the XPATH for a particular element, in my experience, is using Google Chrome’s Developer Tools, which are included in Chrome. Chrome handily includes an “element inspector” that will highlight the source code associated with an element when you hover the mouse cursor over it (or highlight the page when you hover over the code). To access the inspector, open the Chrome Menu and select More Tools/Developer Tools. Then right-click on an element you’re trying to hunt down and select Inspect Element.If you do this for a citation and scroll around the code view, you’ll spot the pattern: every citation we need is inside a P element with CLASS=”CITATION”. Now we just need an XPATH.If you need help deriving the XPATH, you can right-click on one of the elements you want and select Copy Xpath to get the XPATH of that particular element. It is recommended you look at the XPATH of several of the data element you’re trying to extract to help identify the pattern that ties them all together. You can also press CTRL+F to open the search window and type in an XPATH. You’ll see the number of matches as the results of your find. Using these tools, and once I’ve figured out the XPATH that will generalize to everything I need, I can change one line of code which defines an internal variable called pagecases:

    pagecases = scrapy.Selector(response).xpath('//p[@class="citation"]')

    Note that we cannot use a simple ‘p’ because that would match other non-citation paragraphs.

    You can conceptualize pagecases as a collection of cases (in the statistical sense) that appear on one webpage. In the next step of writing our scraper, each collection of pagecases will be broken down into many individual case, each of which will become a new row in your ultimate dataset. So if your unit of analysis is “blog post”, there should be onecases XPATH for all blog posts appearing at a single URL (one webpage).

    I will warn you ahead of time that deriving appropriate XPATHs to capture everything you want but nothing you don’t (high signal-to-noise ratio) is the second most challenging part of developing a scraper, so spend some time getting the pattern right. Every scraping project is just a little bit different.

  7. Now that we have identified the piece of information we want, it’s time to run a test to see if we’re actually getting what we think we’re getting. To do this, we’re going to temporarily place the data being extracted into one of our variables – in this case, citation. Update the item[‘citation’] line to the following:
    item['citation'] = case.extract()
    What this line does is instruct Scrapy to convert the case (remember there are multiple case in pagecases) back into HTML. Right now, it’s in a format only Scrapy understands.Once you’ve completed that step, save and jump back down to theTerminal and enter the following command:
    scrapy crawl projectname -o test1.csv
    After a few seconds of processing, you will see test1.csv appear in your project view, and you’ll also be able to open that file in Excel. Do so now. If everything went well, you should see a bunch of <p class=”citation”> and not much else.Note that projectname in this case is actually the “name” you specified in projectname_spider.py. This way, you can have multiple spiders within the same overall project, if that is something you end up needing.
  8. Now that we’ve extracted the raw HTML successfully from tntlab.org’s homepage, we want to scrape pieces of information that are a bit more meaningful. The simplest case is the citation itself – we want that citation, but we don’t want all the HTML tags in it. To do so, we will strip out those tags after getting the extracted case, then we’ll get rid of any leftover whitespace. Modify your item[‘citation’] line to the following:
    item['citation'] = html.fromstring(case.extract()).text_content().strip()
    What we’ve done here is apply some Python functions to the result of case.extract(). html.fromstring converts the plain text that our scraping created into a Python-interpretable object, then text_content() strips out the HTML tags. Finally, strip() removes any preceding or trailing whitespace (new lines, blanks, tabs). Save and drop back to Terminal and run the following:
    scrapy crawl projectname -o test2.csv
    If you open the resulting file in Excel, you’ll see a nice list of citations.
  9. Next, we need to extract two specific pieces of data from this string: the authors and the year. This is the most difficult part of developing a scraper. This will require regular expressions, commonly called regex. If you aren’t familiar with regex, it is recommended you pause this tutorial and complete this one: http://regexone.com/ followed byhttp://www.tutorialspoint.com/python/python_reg_expressions.htm for a bit more detail on how to implement regex in Python. The most common hangup: % becomes \ (e.g., %d is now \d).To extract the authors and years, change those lines to the following:
    item['authors'] = ", ".join(re.findall(r'^(.*)\(\d{4}.*\)', item['citation']))
    item['year'] = ", ".join(re.findall(r'\((\d{4}).*\)', item['citation']))

    It will be helpful to understand these regex piece by piece, and we’ll concentrate on the year. First, this line runs re.findall, which is a function that searches for a particular regex and returns everything it finds as a list. Here, findall searches for r’\((\d{4}).*\)’ within item[‘citation’]. That string returns four digits located within two parentheses, and it allows text to follow between the four digits and the last parenthesis. This is required to account for presentation citations, e.g., (2014, May). Once the list of dates has been identified by re.findall, .join() attaches them to each other as a single line with “, ” between the pieces. For example, if re.findall ended up producing a list containing 2004, 2005 and 2006, the data that would end up in the file would be “2004, 2005, 2006”. If only one year was found, it would print only that year. If no years were produced, it would print a blank. One year is the target outcome, so anything else is a problem we’ll need to fix.To a degree, the better job you do defining your XPATH, the less work you will have defining your various REGEXs (i.e., less “noise”).A convenient way to test your regex expressions before running the entire scrapy program is to drop down to the Console (the button is next to the Terminal) and use the print command. For example, to test our scrape of year, we might do this:
    print ", ".join(re.findall(r'\((\d{4}).*\)', "Author &amp; Author. (2006, Jan). Words"))
    This way, we can see the output of our test case before running the entire scrapy program we’ve written. In this case, “2006” will output to the console.

    With your REGEX complete, run your program in Terminal and open the resulting file in Excel:
    scrapy crawl projectname -o test3.csv

  10. The final step in building a scraper is debugging. When you open test3.csv, you’ll notice that a few lines of code didn’t work as intended: any citation that is “in press” or otherwise doesn’t match the expected pattern (four digits plus some text) hasn’t been caught correctly.We could fix this with even more complicated REGEX, but instead, we’ll handle it by expanding this section a bit, to demonstrate what kind of processing you can do in Python. Replace the item[‘authors’] and item[‘year’] lines with the following (remember to use tab or 4 spaces for indents):
    if "in press" in item['citation']:
      item['authors'] = ", ".join(re.findall(r'^(.*)\(in press\)', item['citation']))
      item['year'] = "in press"
    else:
      item['authors'] = ", ".join(re.findall(r'^(.*)\(\d{4}.*\)', item['citation']))
      item['year'] = ", ".join(re.findall(r'\((\d{4}).*\)', item['citation']))

    This code checks to see the phrase “in press” is anywhere in the citation; if it is, it creates the variables one way; if it isn’t, it creates them the way we originally created.Always choose the scraping technique that 1) makes the most sense to you, and 2) takes the least amount of time while 3) handling all the cases you need accurately and completely. Programmers have a tendency to build their software to handle absolutely any possible variation that might come across it in as simple code as possible. When you’re building a single dataset that you’re going to export to another program (like SPSS), this is unnecessary. You need the accuracy level you need and no more. Don’t waste time making it more robust unless you plan to use this same scraper in the future.Now that you have a complete scraper, try one more time in Terminal:
    scrapy crawl projectname -o test4.csv
    In Excel, you should see a perfect dataset, scraped from tntlab.org. That means your scraper is finished, for now. Normally, you would at this point test a few different pages to see if any new exceptions crop up. Once you’re confident you can handle any sort of data that gets thrown at you, it’s time to build the crawler.
  11. Other points of note as you design your own scrapers:REGEX can be used inside of XPATH. Use whatever combination of XPATH and RE that makes sense for your particular project. Remember that the result of the initial XPATH should be as narrowly defined as possible while still including all of your target variables.Well-designed and logically consistent websites are much easier to scrape than badly maintained and flawed websites. If you run into many flawed websites (e.g., sometimes a tag is used to indicate the data you need and sometimes it isn’t), you may need to write more complex code to identify the parts you want. This may requiring learning a significant amount of Python. I suggest you start here:http://www.codecademy.com/en/tracks/pythonFor those already familiar with Python, you might wonder why I used findall() instead of match() or search(). The reason is simple: findall() always outputs a list, whereas match() and search() output a NoneObject if they don’t find anything. That means I would need to add code to handle how to handle a NoneObject. That makes less sense to me than findall() and join(), but if it makes more sense to you, you should do that instead. Remember: there are many ways to solve a programming problem and rarely a “best” way.

    If you don’t want to paste in all of the websites directly into your spider, there are alternative approaches to read from a CSV file (called CSVFeedSpider) or an XML file (called XMLFeedSpider), which you can read about in the Scrapy documentation:http://doc.scrapy.org/en/latest/topics/spiders.html

    Although you can scrape any information you want using REGEX, it might be useful/easier to use XPATH. For example, your case might be defined by a DIV with several other elements within it. In these cases, you can search for an XPATH. For example, I could extract only the text within the P element within a case with: case.xpath(‘p/text()’)

Building a Crawler/Spider to Scrape Multiple Webpages

  1. To build a crawler, you must first identify how the pages you want to harvest are linked together, from a technical standpoint. The easiest scenario is when all of the pages you want are linked by a common URL structure. For example, a blog might have this structure:http://website.com/01/02/2014/title_of_the_first_post.html
    http://website.com/05/22/2015/title_of_the_second_post.html
    http://website.com/08/31/2016/title_of_the_third_post.html
    http://website.com/04/01/2017/title_of_the_fourth_post.htmlIn this case, there is a common pattern:http://website.com/ xx /xx/xxxx/title

    You would then specify in projectname_spider.py to only follow links that fit this pattern, using REGEX.

    In cases where an API (application program interface) is available, you would access data through that API instead of harvesting webpages directly. For example, Twitter and Facebook both provide APIs. When spidering, you would cause a great deal of overhead and cost to Twitter or Facebook for accessing hundreds of thousands of pages of content, and you could also be locked out of their pages once they detected you were doing it, since scraping directly is forbidden by the documents you agreed to when signing up for those websites.

    The API provides a “back door” to pull data directly from the database running their website without access to the website itself. In many cases, APIs also enable you to download data that are not easily scrapable. For example, Facebook replaces day/time information on its posts with user-friendly shorthand, like “about an hour ago.” When scraping data, you would not want “about an hour ago” in your dataset – you would want an actual date and time. The API enables you to get that information. We will discuss APIs in a later section. For now, we will explore scraping when an API is not available.

  2. The simplest case here is a basic spider. This is needed when you have a well-defined set of less than 20 or so webpages. We’ll do this editing inprojectname_spider.py. In this case, it is generally faster to copy/paste the URLs into the start_urls variable directly. For example:
    start_urls = (
      'http://urlno1.com/',
      'http://urlno2.com/',
      'http://urlno3.com/',
    )

    Note that there are quotes surrounding each URL and a comma at the end of each line. When you run this crawler, each page will be loaded sequentially and the scraper you wrote earlier will be executed on the results of each page (each webpage creates a new instance of pagecases).
  3. The more complex case occurs when you need to use information contained on each page to identify other pages to scrape. For example, in an online discussion list, you would want to collect every post available – potentially thousands or tens of thousands of pages. This requires a specific type of spider called a crawler, so called because it crawls across the web, extracting information from each page it sees to identify other requested pages and then crawl over to them too.Where crawlers get tricky is telling them when to stop. If you were just to follow every link found on every page of a website, you would eventually collect the entire internet – in fact, this is precisely what Google does with their crawler, the Googlebot. For a research project, that kind of behavior is much less useful. Fortunately, Scrapy provides several tools to make crawler design quite easy.Up to this point, we’ve been using the simplest case, called Spider. For this example, we will convert the spider to a crawler by making two changes. First, right beneath “import scrapy”, add the following two lines:
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors import LinkExtractor
    from tutorial.items import SpideringItem

    Next, change the “class” line to the following (note that the entire expression “scrapy.Spider” has been changed to “CrawlSpider”:
    class ProjectnameSpider(CrawlSpider):
    If you run your project now, you will get the same output you got before with scrapy.Spider. The reason is that CrawlSpider is actually a type of Spider, so most of the same commands should work. In practice, you could always specify a CrawlSpider to get the functionality of Spider.
  4. Now we are set up to build a crawler. Crawlers generally start from a single webpage and crawl based upon the information it contains. We’ll build a variety of crawlers to demonstrate the power of Scrapy’s crawling tools. I’m going to do this in the same file as the spider and scraper we already built, but you could alternatively create a brand new project and do it there if you want to preserve your work. Up to you.For this first crawler, we’ll harvest every page on an entire domain that we can find a link for. In general, you do not want to do this for actual scraping projects, but it provides a conceptual starting point for building the rest of the crawler.To build our new crawler and scraper, we will first disable the scraper we wrote earlier. The easiest way to do this without deleting existing code is to highlight the entire scraper block (from def parse down to return items) and press CTRL+/. This will “comment” out every line by adding a # in front of it, which instructs Python to ignore whatever it sees there. This is also a good way to provide reminders to yourself as you code. You should end up with something like:
    # def parse(self, response):
      #
      .
      .
      .
      # items.append(item)
    # return items

    Whenever you use CrawlSpider, you should not have a parse() function defined. Next, we’ll create our crawler and scraper. Beneath the start_urls = () section, add:
    rules = (
      Rule(LinkExtractor(allow=('', )), callback='parse_crawled', follow=True),
    )
    
    def parse_crawled(self, response):
      pagecases = scrapy.Selector(response).xpath('//h3')
      items = []
    
      for case in pagecases:
        item = SpideringItem()
        item["url"] = response.url
        item["content"] = html.fromstring(case.extract()).text_content().strip()
        items.append(item)
    
      return items

    When you do this, remember to keep the indention level consistent; both the rules and parse_crawled() declarations are within class ProjectnameSpider, so both need to be indented one level in.

    If you take a moment and try to figure out what the scraper does, you should conclude that it extracts all the content under H3 headings and spits out two items: url, which will contain a url, and content, which contains the text of the H3.

    There are a few things to notice here. One, the function is not called parse(). The reason for this is that parse() is always called, for every page found. When we crawl, we don’t necessarily want to scrape every page, so we want more control than this. So instead, I define my own function (called parse_crawled()), which contains scraping instructions. The parse_crawled() function is called every time the Rule() above is met. You can interpret the Rule line as meaning: Every time this Rule is met, call to the parse_crawled() function to process what you found (thus: callback).

    Two, if you followed the scraper tutorial, you’ll also notice that I’m using a different Item definition than I did the first time – instead of ProjectnameItem, I’m referencing SpideringItem. There’s nothing special about this name. In fact, now we need to define it. To do that, go back to items.py and add the following new block:

    class SpideringItem(scrapy.Item):
      url = scrapy.Field()
      content = scrapy.Field()

    Although ProjectnameItem is still defined here, we’re not going to use it in our new crawler/scraper.

    Drop to Terminal and run scrapy on this new crawler/scraper:
    scrapy crawl projectname -o test5.csv
    Open in Excel, and if successful, you should see two columns: all the headings from the entire tntlab.org website in the second column and the page that heading came from in the first.

  5. Now that we have a barebones crawler, we can work through customization to build our second crawler. Remember that in “real” projects, you would probably not actually run your barebones crawler. Instead, you’d want to customize. For example, on a website hosting a discussion board that you were targeting, you would probably only want the discussion board and not all of the other pages on the website. To add this sorts of customization, we will add a Rule.There are two major purposes to a Rule, which require slightly different approaches. The first approach, you’ve already taken: a Rule can be used to identify a page to be harvested (a “scrape” rule). The second approach, however, is to use a Rule as a stopping point only on the path to identifying more pages (a “search” rule). In these cases, you don’t want to harvest information for your dataset from the page, but you do want to harvest links to identify other pages.For our example, let’s change the existing rule to a search rule. To do this, just get rid of the callback:
    Rule(LinkExtractor(allow=('', ))),
    With this rule, the crawler will collect links only (nothing for your dataset) for every website it can find within the allowed_domains list. If you run it now, you’ll end up with a blank dataset.Let’s say (for our example) that we only wanted websites with the letter ‘a’ in the filename, but we wanted to look at every website on the server for links. The code would change to this:
    Rule(LinkExtractor(allow=('tntlab.org/.*a.*php', )), callback='parse_crawled', follow=True),
      Rule(LinkExtractor(allow=('', ))),

    As you can see, we’ve used a REGEX to specify precisely the filenames we want. Run an export to test6.csv and see the difference: only files with “a” were scraped.

    Two important caveats:
    a) Scrapy runs through these rules in order, so as soon as a page matches a Rule, it won’t be checked against subsequent rules. That means you’ll want your most specific Rules first and your less specific ones later.

    b) You’ll notice that I have a “follow” flag in one of the Rules but not the other rule. The reason is a little odd and can lead to unexpected spidering: if “callback” is present, Scrapy will not follow links on the page that matches that rule. If I left off “follow”, any page matching the “a” pattern would not be searched, just scraped. Since I want to scrape and search for more links, I need to write follow=True. In the second Rule, there is no callback, so follow=True by default – you can write it anyway, but it won’t do anything.

  6. If you have a complex crawl where you need to exclude some pages that otherwise fit a pattern you’ve identified, you can also add another term called “deny” which also takes regular expressions.For our third crawler, we will look at every website on the server for links but only crawl sites with “a” in the filename as long as they did not also have “ff”. To do this, we change the first rule as follows:Rule(LinkExtractor(allow=('tntlab.org/.*a.*php', ), deny=('tntlab.org/.*ff.*php')), callback='parse_crawled', follow=True),
    Export this to test7.csv, and you’ll find that data from crawled pages containing “ff” have disappeared.As you can see, Rules and LinkExtractors become increasingly complicated with more requirements. The key to building a crawler is using a combination of Rules and LinkExtractors that will find the pages you want but only the pages you want.

Saving Time during Debugging with the Scrapy Shell
Up to this point, we’ve been running the full Scrapy platform each time we wanted to test our program. However, once you have a feel for how Scrapy works, and assuming you know a little bit of Python, it’s faster to use the Scrapy Shell. To use the Scrapy shell on your current project, drop to Terminal and type:
scrapy shell "http://www.tntlab.org"
This will fetch the tntlab.org website and hold it in memory so that you can run commands on it in real time. This way, you can test many different formats of XPATH or REGEX without re-downloading the file multiple times. Let’s step through some examples.

  1. First, enter the following command to ensure that tntlab.org was fetched correctly. You should see a copy of the page you downloaded in your web browser.
    view(response)
    
  2. Next, we’re going to manually enter the commands from our parse() or parse_crawled() functions to see if everything works as we expect. Start by entering all of the needed import and from commands at the top of your spider. To test commands starting with scrapy, html, and re, that means we’ll need:
    import scrapy
    import re
    from lxml import html

    These commands will differ depending on what you want to test. For example, if you wanted to test item creation, you could import the item declaration from projectname.items.
  3. Next, we’ll declare our own copy of pagecases. To do this, type the command exactly as you have it in your program. From our last example:
    pagecases = scrapy.Selector(response).xpath('//h3')
    To see what pagecases contains, simply type its name:
    pagecases
    You will see a shortened version of each element in the pagecases list. Note that this display does not show the entirety of the contents of pagecases – just the first several characters. You will also note that pagecases contains 3 Selectors, each of which contains other information. To isolate one of these cases for further study, type in the following:
    pagecases[0]
    In Python, lists are numbered from 0 up, so pagecases[0] selects the first item in the pagecases list.
  4. Next, let’s test out a few different XPATHs so that you can see how much easier it is to test your XPATHs in Shell. Type the following:
    scrapy.Selector(response).xpath('//p')
    scrapy.Selector(response).xpath('//h1')

    Each time, you will see what this command extract with the new XPATH. Try out a few different XPATHs and see what happens.
  5. Once we have the XPATH we want, reassign it to the pagecases variable and then assign the first element to a “case” variable, in order to mimic the code you have in projectname_spider.py:
    pagecases = scrapy.Selector(response).xpath('//h3')
    case = pagecases[0]
    content = html.fromstring(case.extract()).text_content().strip()

    In our actual program, case is reassigned in a loop across everything within pagecases. But since we’re working with Python in real time, we will only look at one entry at a time. Notice that you could work with case instead of content. It’s up to you.
  6. At this point, we can test how our REGEX or other XPATHs work. Let’s say that we wanted to extract the first word of every heading. Try the following:
    content
    ", ".join(re.findall(r'^.*?\ ', content))

    First, we’re looking at content to see the raw case we’re working with. Second, we try out a regex to see what we get. In this case, we get almost what we want – but not quite. We’re grabbing both the first word and the space after it. Try a little tweak:
    ", ".join(re.findall(r'^(.*?)\ ', content))
    Perfect. At this point, we would copy our fixed REGEX back into our program above. In this way, Shell is a useful tool to see the impact of changing your XPATH and REGEX without waiting for the whole Scrapy program to run (and downloading the test page multiple times).Remember that you can use Shell to test out the REGEX in your LinkExtractors before actually doing any crawling.When you’re done using Scrapy, exit with this command:
    exit()
  7. Shell can also be called mid-Scrapy. This is useful if there is a particular response pattern causing problems, but you are unsure how to find it manually. To do this, add the following somewhere meaningful in your code after response has been declared:
    from scrapy.shell import inspect_response
    inspect_response(response)

    When you run your program, the program will pause temporarily and allow you to run shell commands on the data that Scrapy is currently looking at.

Sample Project
To test that you’ve gained the skills covered in this tutorial, next try the following project. My code follows afterward, but don’t look at it until you’ve got your own answer:

Return a spreadsheet that does the following:

  1. Crawls throughout the “Research in Action” section of the APA website:http://www.apa.org/research/action/index.aspx. Use a Rule to capture the list of websites (don’t just list the website URLs).
  2. Produces a dataset containing one line per external link provided by APA. For each line, collect a) the raw HTML that defines whatever you extract, b) the category (e.g., Aging), c) the name of the link, d) the link itself, and e) the description of the link (four variables).

Sample Project Solution
You probably found a solution that looks similar but not identical to mine. Remember – there are many ways to program. For example, in this case, I have opted to use only XPATH to collect data. You may have used REGEX.

To start, I entered in Terminal:
scrapy startproject apa
I then changed only two files:

In items.py :

import scrapy
class ApaItem(scrapy.Item):
  raw = scrapy.Field()
  category = scrapy.Field()
  link = scrapy.Field()
  url = scrapy.Field()
  desc = scrapy.Field()

In apa_spider.py:
from apa.items import ApaItem
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class ApaSpider(CrawlSpider):
  name = "apa"
  allowed_domains = ["apa.org"]
  start_urls = (
    'http://www.apa.org/research/action/aging.aspx',
  )
  rules = (
    Rule(LinkExtractor(allow=('apa.org/research/action/.*.aspx', ), ), callback='parse_crawl', follow=True),
  )

  def parse_crawl(self, response):
    pagecases = scrapy.Selector(response).xpath('//section[@id="ctcol"]/ul/li')
    items = []
    for case in pagecases:
      item = ApaItem()
      item["raw"] = case.extract()
      item["category"] = scrapy.Selector(response).xpath('//title/text()')[0].extract().strip()
      item["link"] = case.xpath('a/text()').extract()
      item["url"] = case.xpath('a/@href').extract()
      item["desc"] = case.xpath('p/text()').extract()
      items.append(item)
    return items

More Complex Spiders and Scrapers
One of the tools not discussed in this tutorial is the Item Pipeline. Item pipelines allow you to choose to exclude an item after it has been processed. For example, imagine that the information you want to extract is in a particular XPATH: //div/p. These P elements are the most specific XPATH you can get to. However, there are other //div/p that you don’t want. In the item pipeline (pipelines.py), you can define which P to ignore after processing using the DropItem method and IF commands. DropItem will eliminate whatever case you specify from your final dataset.

Every item you create in your spider will be sent to the item pipeline, so be careful not to drop anything unintentionally. You can also do more advanced item handling here; for example, imagine you were crawling a website that had two formats of pages. On one format of page, measurements were given in pounds, whereas in the other, in kilograms. The item pipeline could be used to convert one format to the other so that all output in your final dataset was in a consistent scale of measurement.

Because Python is a programming language, you have ultimate control over exactly how Scrapy parses and processes your data. For example, you might have multiple rules that each call different parsing functions to handle different types of data differently. Carefully consider exactly what data you are trying to extract and what you want your dataset to look like when you’re done, and implement the code needed to make it happen.

Accessing Databases through an API
Some websites cannot be scraped easily, and in some cases, scraping is forbidden by the terms of service of that website. In these cases, an API is often provided for you to send data requests directly on the website’s databases. Such APIs are typically designed to only give you access to information that the company wants you to have, which is intentional. You may also be limited in the number of requests you can make in a given time period (e.g., 40 requests per 24 hours). Scraping as we’ve covered it here on such pages may be illegal in your particular jurisdiction, so scrape carefully.

When an API is present, you are not longer “scraping” or “crawling.” Instead, you are sending specific requests to the server and getting data back. Once the data comes back, you may need to format it in code to make it readable, and you can still do this in Python. However, you don’t need Scrapy.

Because every API is different, you will need to look up online documentation for how to access the particular API you want using Python, learn what commands the API wants, and learn how to create them as you intend to create them. Some APIs require registration; for example, the Facebook API will log in through your Facebook account, and you will only have access to data that your Facebook account has permission to access. So if you thought this approach is a convenient backdoor to download every interaction from every person on Facebook, it is not. You will only have access to the data Facebook allows you to access.

The same is true for Twitter, but in practice, that only means you don’t have access to Private Twitter accounts (and there are not many of these, relatively speaking). However, Twitter appears to restrict access to its full database arbitrarily, so be careful that you are accessing what you think you are accessing.

When looking for API documentation, the developer in most cases provides some signposts to get you started. For example, Twitter provides the following information:

https://dev.twitter.com/overview/api/twitter-libraries