Last month I attended Hack Manchester – a 24 coding event as part of the Manchester Science Festival, held at MOSI. Having only arranged to team up with Mike, we ended up joining two guys Shaf introduced us to, his colleagues from the BBC, by the names of Jack and Tom. The four of us formed a team, and after browsing the challenges set, we liked the idea of Intechnica‘s Bacon Number problem the most, but rather than just solve the Bacon Number problem, we derived the challenge and set off to build a tool to find the film set with the largest birthday party (most common birthday per film, among actors in the same film).
We decided the data provided was too poorly formatted, and because any alternatives (such as the Open Movie Database) required sign-up and prior approval, we ended up scraping IMDB for the actor birthday data. I wrote a Python script using Beautiful Soup which worked really well, it hit every day page on IMDB and stored each of the actor’s names in a MongoDB collection in the following format:
I ran the script for the 1st Jan page, to see that it worked – and I had a number of records stored, all with ’01-01′ as the date, so it seemed to be working ok. I wrapped a loop around it to hit every day of every month. I started running the script, with no idea how long it would take. I quit it quite early and added a print statement on every new month for indication as to where it was up to. Just how I used to solve most of the number crunching Project Euler problems! I watched it run, and it seemed to take about a minute and a half per month, so it took about 20 mins to run in total (it also crashed out at one point when it got a 500 error from IMDB – I deleted all from the collection from May (incomplete) and ran it from there again, in order not to get duplicates, or miss any out!). Also, I should point out that we were having to run off tethering from my phone, because the Hack Manchester wifi only had 100 IPs to dish out (not ideal for a hackathon with 250 geeks with ~4 devices each!) – a real shame as I’m sure the organisers did all they could to reassure the providers that they would need a lot of connected devices. Quite a lot of data ran through my phone that night – hundreds of hits at IMDB, various packages (such as Beautiful Soup, the Mongo libraries, the IMDB text file data, etc.)
I sanity-checked this data, by looking at the number of records held in each of the dates in the collection. I noticed that 1 January had a significant number more than all the other dates. I assumed I had left the data in from when I initially ran it on 1st Jan to test the script – although it was more than 2, even 3 (actually about 10) times all the other days. I deleted Jan 1st and ran it on that day again, and got the same number. I looked at the IMDB page for 1st Jan and there were genuinely a lot more than for any other day. I asked around my team mates for an idea – someone suggested that people aim for 1st January as a birth date, but I said it’s not distributed among nearby dates, and that didn’t really make sense anyway. Of course (you probably already deduced this – please excuse us – we were tired), it was that 1st January would have been the default value if no date was entered, or maybe this list included actors without a birth date given.
I committed this code at around 1:45am, and about 45 minutes later, while browsing the team’s work on github, I noticed the commit times for some files. The times, given in a friendly time-relative human-readable way were:
What’s that? I committed the file … in 16 minutes? As in, in the future? How is that so? Well of course, Hack Manchester happened overnight on the day in the year when daylight saving reverts back and we move from BST to GMT, and this happens at 2am, when it goes back to 1am. So every ‘time’ between 01:00 and 01:59 happened twice. I thought this was rather amusing :)
We then had a searchable database of actors and their birthday. Jack whipped up a Twitter Bootstrap web interface, in to which I added some PHP code (using the PHP MongoDB library) to display a list of actors with a given birthday, or show a given actor’s birthday. At this point we have no movies stored, so we had limited functionality.
Meanwhile, Mike had been writing a bunch of PHP classes containing methods for looking up the data. He’d also started writing a Ruby script to extract film-actor data from some text files he found somewhere. He’d had real trouble extracting out the data in a way it would be useful to us. It was tab-separated and had referenced films by random alphanumeric IDs rather than film names, and also contained a ridiculous number of porno films. Later on, Jack and I adapted this code to try to get it to insert the data in to our existing MongoDB collection. It was quite fiddly, and we weren’t really sure how accurately the data was being collated, but worth a try!
At this point we had a discussion about how we would store the data. Someone suggested:
We need another table to store the films, and another to store the film-actor relations
Erm, that’s not how Mongo works. I’m no expert, and my solution may not have been the best, or Mongo-est, but I know you can store lists as values (making multiple ‘tables’, or collections or whatever, unnecessary), so I suggested we would be fine to add a ‘movies’ field to the existing actor storage, which would be a list of films they’d been in, e.g:
We managed to figure out how to add the movie field to an actor, and how to append a movie to list already containing one, and we wrote this in to the script and let it run. We left in a print statement to see what was happening, which obviously slowed the process down a lot. Think about how many movies there are, and think about how many actors there are. Now think about how many instances of an actor being in a movie. That’s a lot. It took bloody ages. And didn’t seem to work. We were out of time by this point (in fact time was almost up when the program started running). Doing this properly we’d have tested it better, and ensured all data was being entered correctly. We were just having a bash at getting it to work.
All of us completely exhausted, we awaited the event closing and awards ceremony. Mike and I had stayed in the museum all night – each attempting a short nap on a couple of occasions, rather unsuccessfully in a room full of geeks bashing away at their respective keyboards. Tom had a prior engagement, so he shot off early evening, and Jack headed home later on due to problems with the wifi, and worked on setting us up an amazon instance to host the project from home.
Among the hackers were many friends of mine – including a team consisting of Michael Heap and Tim Hastings; an MMU team with Farkie; a Manchester Girl Geeks team; a couple of Laterooms team including Mark/Kirsty, Jim & Andy; a Thoughtworks team with Daley, and so on. I had plenty of people to chat to while taking breaks (I drank a lot of coffee) – and met a bunch of new people too.
It came to the closing and the winners of each category was named, and had a chance to give a short demo of their project. Some amazing stuff went on show – it was great to see so much innovation from so many teams. By chance, no-one else had chosen the Bacon Number challenge, so we won by default! A bit lame, I know, but the way I see it is that we weren’t so awful that they decided to withdraw! I count that as a win. And what was the prize? A brand new 512MB Raspberry Pi each! Can’t complain! Huge thanks to Intechnica for the prizes :)
Also a great big thanks to Gemma and Sean for putting the event on. It was fantastic! I will definitely enter events like this in future, even without a team – you can always group up with people and get something done. I was worried about working with people who used different languages or frameworks and that we wouldn’t be able to get things done, but we pooled ideas and skills together and managed to build some cool stuff! Also thanks to MOSI for the use of the space (all through the night!) during the science festival.
The code from our hack is available at github – it may or may not get updated/fixed in future, but at the time of writing was as we left at the end of the event
Also check out Farkie’s blog post on the Magma Digital blog – Hack Manchester 2012