How to Scrape Website Content

Spreadsheets,Tips 15 April 2011 11 Comments

While this tip might be a little more advanced, it shows how easy it is to “scrape” content from websites into your spreadsheets. We’ll be relying on the fact that HTML is a form of XML (strictly speaking XHTML is a valid form of XML, but for all intents and purposes we can assume HTML is close enough to a form of XML).

  • Go to the website you want to scrape content from. In this example, we’ll be taking data from a person’s twitter stream, for example http://twitter.com/brickandrews.
  • By looking the page source, determine what html element on the page the content you want is found in. For example, the twitter updates on my twitter page are found in span elements of the class “entry-content”:

  • In your Google Docs spreadsheet, enter the following formula:
=importXML("http://twitter.com/brickandrews","//span[@class='entry-content']")
  • The spreadsheet will fill in with the content from the imported xml:

The importXML function allows us to scrape the content from my twitter page. It takes two arguments:

  1. The url of the xml you want to import (in this case, its the html from the page at http://twitter.com/brickandrews!).
  2. An xpath query that specifies what part of the xml you want to import. This is the trickiest part. In my example, the xpath simply specifies any span element with a class attribute equal to “entry-content”. It’s easy enough to hack around with the xpath, but you also might like reading up on how to create xpath queries.

Because my twitter page is constantly updated, the spreadsheet will constantly be updated to reflect this “live” data. Of course web page content isn’t the only kind of xml you can import using the importXML function. Many web apps (like twitter, Google Apps, and more) provide xml based API’s which you could also import from using the same method described above. If you’ve got an interesting example, please share it in the comments!

As always, you can check out the spreadsheet used in this example.

Tagged in , , , , , , , , ,

11 Responses on “How to Scrape Website Content”

  1. Michele says:

    Interesting. Why not just copy and paste and give props to the author.

  2. admin says:

    Hi Michele, the problem with cut and paste is that you’d only get a one-time snapshot of the web site content in your Google Docs spreadsheet. By utilizing the method here, your spreadsheet will reflect the most current content on the site you are “scraping”. I agree with you, if you are going to share your spreadsheet, it would probably be nice to acknowledge the source(s).

  3. mauren says:

    Just watching your vid from appsumo, really nice.

    RE: scraping website content as per this page.

    It only works for me when I use =importXML(“http://twitter.com/brickandrews”;”//span[@class='entry-content']“) with “;” instead of “,”.

    And why does CNN not work but for example bol.com does?
    importXml(“http://www.cnn.com”; “//a/@href”)
    importXml(“http://www.bol.com”; “//a/@href”)

    Amazon is also not working for me: =importXml(“http://www.amazon.com/gp/product/B000VECAEE”; “//b/[@class='priceLarge']“)

    Please note that I’m from the netherlands, so there might be different versions of Docs?

    • darling says:

      Could it have something to do with the limit of requests with in a time limit? Web services tend to keep a cap on things.

  4. Erica says:

    Hi there,

    I am looking at your source code, however your contents appears to be in

    not entry-content.

    Your formula still works. Am I missing something? Thanks!!!!

  5. Greg says:

    Is there any way to get this to work on an https site that requires a logon? I followed these steps to attempt to scrape account balances from my bank but even if I am already logged on to the site it still wont work. Google Docs returns an error telling me that the data cannot be retrieved and to check the URL.

  6. admin says:

    Hi Greg, you may not be able to use the ImportXML method when authentication is required. In general, the ImportXML function in Google Docs spreadsheet can not process data that is created in a two-step process. For example, when an authentication token must be retrieved first before making the url request, or when the URL tells the server to dynamically create an xml output after which the user is redirected to the output, even when the URL stays the same. You might want to look into Google Apps Scripts (http://code.google.com/googleapps/appsscript/index.html) to handle this case. I hope this helps. BTW, creating a script to handle this situation might be a good topic for a new post. Stay tuned!

  7. RMB says:

    can you use this method to search a topic on twitter? like Apple, as this doesn’t work:

    =importXML(“http://twitter.com/#!/search/aaple”,”//span[@class='entry-content']“)

    in other words i want to pull in tweets about a subject or hashtags like #apple, thanks

  8. chris says:

    I am having trouble understanding xpath, I have reviewed the pages referenced and more, but still having trouble. I am trying to pull weather data into a Google doc then manipulate it. The XML file is
    http://forecast.weather.gov/MapClick.php?lat=45.81732&lon=-119.98581&FcstType=dwml (from the html file, http://forecast.weather.gov/MapClick.php?lat=45.817315080406246&lon=-119.98580932617188&site=pqr&unit=0&lg=en&FcstType=text). Any basic Xpath suggestions on pulling data into a Google doc from this page should give me a better start.
    Thanks!!
    Chris

  9. DJ says:

    How would you go about creating a two step process for authentication to pull my bank account balance automatically into my google docs budget spreadsheet?

Leave a Reply to Greg