onsdag 22 januari 2014

Scraping Fund Data of the Internet - an Orgy in Different Tools

Previously, I've used a combination of Bash and curl to scrape pages with fund statistics from investment sites and then Perl scripts to data-mine the scraped pages and analyse the data. Good so - small, customized, targeted Bash and Perl scripts.

However, recently, I wanted to also use the Norman value (an estimate on how much a fund will cost over 10 years of keeping it in a investment portfolio). The only source for these value was one popular fund comparison site, so I set out to bash and curl my way to a loop that would scrape the data off it - only to run into serious trouble...

Using the Firebug plugin in Mozilla Firefox, the built-in Developer's Tool in Google Chrome, and eventually resorting to Web-Charles to really be able to see what was going on, I was able to exactly pin-point what my browser sent the site when I navigated the pages - but I was unable to mock the request in my scripts. Unfortunately, the site relied just too much on Javascripts dynamically building form-data to be posted to the site with every request. So I had to think of something else.

Enter Selenium, the Web Automation Tool. Since it is using an ordinary browser to perform the surfing, dynamic Javascript mumbo-jumbo is no match for it. But I still wanted to script my usage and eventually choose Watir for the job, once again being surprised over how much joy it is to program in Ruby!

So, writing a short Ruby program using the Watir framework to operate on Selenium, I was able to get the data I wanted from the site in question.

However, I begun with restricting the funds to just those of the bank I am a customer off (it was my funds there that was most important to re-evaluate and perhaps exchange for others at the turn of the year). When I for kicks tried to scrape all of the 2 500++ funds of the site, I ran into new problems. The site was simply not stable enough. Now and then, one would either end up on the very first fund presentation page with a subtle "An error occurred"-message, or actually a full-fledged "An error occurred"-page, or one would seemingly be on the right track, getting to the next of the 100++ pages of 20 funds at a time, but then realizing that it had silently thrown you from the target tab back to the default tab. To battle this, I had to go heavy on error handling, turning my short and elegant Ruby script into a less nice collection of rescue blocks for this and that exception (some that I defined and raised myself, for example if browser.text.visible?("An error occurred") is true).

All in all, a quite educational and rewarding exercise.