Using Threads For The First Time
November 30, 2013
If you haven’t read any of my previous entries I’m currently working on a simple program to figure out who as the longest streak on GitHub. I’ve been progress every day and now things are really starting to speed up.
404 Not Found (OpenURI::HTTPError)
The first problem that I fixed was I was getting 404 errors when the username didn’t match a GitHub profile page error.
def page(username)
begin
open("https://github.com/#{username}").read
rescue
"error"
end
end
Now I just catch the exception and return “error”. I could probably do some more optimization where I don’t try and parse the string “error”, but I think it is okay to leave it in there for now.
Using threads to parse profile pages
After what I learned yesterday about threading from the Ruby Rouges, I was able to use threads to parse all ~100 profile pages in each batch at the ‘same’ time instead of parsing them one by one. As you might guess this sped things up dramatically.
Sequel Pool:Timeout errors
I guess one side effect of using threads is that I started getting sequel pool_timeout errors. I “fixed” this by increasing the pool_timout value, but I’m not sure if that was really the best solution.
One thing I could do is store all the results in an array and then do one big insert of about 100 rows. I’m not sure about updating multiple rows at a time though if I don’t need to do an insert.
Measuring Thread Performance
I made a quick little video just to see what the difference in speed is now that I’m using threads to parse the profile pages:
In roughly 60 seconds without threads I was able to parse 93 pages, and with threads I was able to parse 313 pages. So, by using threads I was able to increase my performance 3x. Which is huge considering I need to parse 4,000,000+ profile pages. Currently I’ve stored the longest streak for 105,184 users, which is a huge increase from 30,000. However, it appears to be locked up. My script is still running, it hasn’t errored out, but it’s not downloading any more users so I’m not sure what is going on there.