Gathering Data
November 29, 2013
Yesterday before I had to put the turkey in the oven I was able to finally start using the GitHub api to start downloading a list of all the users and to parse their profile page to get their longest streak data.
The process isn’t really fast and I currently have a bug in my program where sometimes the username doesn’t match with their profile page or maybe it is of a deleted user or something and so their isn’t a profile page any more. Anyways, my program crashes every time it encounters one of these 404 errors so I need to make sure I catch for that so that I don’t have to keep restarting my script.
I really wish that I would have had an answer by now for who has the longest streak on GitHub, but currently only have data for about 35,000 users and there happens to be over 4 millions users.
In the latest episode of the Ruby Rouges podcast they talk about Threading with Emily Stolfo so I think that is something I need to figure out how to do. Currently I get a chunk of about 100 users from GitHub, but then I have to parse the profile page of each user which takes the longest amount of time. From what little I know about threading I should be able to parse all of the users profile pages in separate threads which should speed up my app significantly.