Last month I attended Hack Manchester – a 24 coding event as part of the Manchester Science Festival, held at MOSI. Having only arranged to team up with Mike, we ended up joining two guys Shaf introduced us to, his colleagues from the BBC, by the names of Jack and Tom. The four of us formed a team, and after browsing the challenges set, we liked the idea of Intechnica‘s Bacon Number problem the most, but rather than just solve the Bacon Number problem, we derived the challenge and set off to build a tool to find the film set with the largest birthday party (most common birthday per film, among actors in the same film).
We decided the data provided was too poorly formatted, and because any alternatives (such as the Open Movie Database) required sign-up and prior approval, we ended up scraping IMDB for the actor birthday data. I wrote a Python script using Beautiful Soup which worked really well, it hit every day page on IMDB and stored each of the actor’s names in a MongoDB collection in the following format:
I ran the script for the 1st Jan page, to see that it worked – and I had a number of records stored, all with ’01-01′ as the date, so it seemed to be working ok. I wrapped a loop around it to hit every day of every month. I started running the script, with no idea how long it would take. I quit it quite early and added a print statement on every new month for indication as to where it was up to. Just how I used to solve most of the number crunching Project Euler problems! I watched it run, and it seemed to take about a minute and a half per month, so it took about 20 mins to run in total (it also crashed out at one point when it got a 500 error from IMDB – I deleted all from the collection from May (incomplete) and ran it from there again, in order not to get duplicates, or miss any out!). Also, I should point out that we were having to run off tethering from my phone, because the Hack Manchester wifi only had 100 IPs to dish out (not ideal for a hackathon with 250 geeks with ~4 devices each!) – a real shame as I’m sure the organisers did all they could to reassure the providers that they would need a lot of connected devices. Quite a lot of data ran through my phone that night – hundreds of hits at IMDB, various packages (such as Beautiful Soup, the Mongo libraries, the IMDB text file data, etc.)
I sanity-checked this data, by looking at the number of records held in each of the dates in the collection. I noticed that 1 January had a significant number more than all the other dates. I assumed I had left the data in from when I initially ran it on 1st Jan to test the script – although it was more than 2, even 3 (actually about 10) times all the other days. I deleted Jan 1st and ran it on that day again, and got the same number. I looked at the IMDB page for 1st Jan and there were genuinely a lot more than for any other day. I asked around my team mates for an idea – someone suggested that people aim for 1st January as a birth date, but I said it’s not distributed among nearby dates, and that didn’t really make sense anyway. Of course (you probably already deduced this – please excuse us – we were tired), it was that 1st January would have been the default value if no date was entered, or maybe this list included actors without a birth date given.
I committed this code at around 1:45am, and about 45 minutes later, while browsing the team’s work on github, I noticed the commit times for some files. The times, given in a friendly time-relative human-readable way were:
What’s that? I committed the file … in 16 minutes? As in, in the future? How is that so? Well of course, Hack Manchester happened overnight on the day in the year when daylight saving reverts back and we move from BST to GMT, and this happens at 2am, when it goes back to 1am. So every ‘time’ between 01:00 and 01:59 happened twice. I thought this was rather amusing :)
We then had a searchable database of actors and their birthday. Jack whipped up a Twitter Bootstrap web interface, in to which I added some PHP code (using the PHP MongoDB library) to display a list of actors with a given birthday, or show a given actor’s birthday. At this point we have no movies stored, so we had limited functionality.
Meanwhile, Mike had been writing a bunch of PHP classes containing methods for looking up the data. He’d also started writing a Ruby script to extract film-actor data from some text files he found somewhere. He’d had real trouble extracting out the data in a way it would be useful to us. It was tab-separated and had referenced films by random alphanumeric IDs rather than film names, and also contained a ridiculous number of porno films. Later on, Jack and I adapted this code to try to get it to insert the data in to our existing MongoDB collection. It was quite fiddly, and we weren’t really sure how accurately the data was being collated, but worth a try!
At this point we had a discussion about how we would store the data. Someone suggested:
We need another table to store the films, and another to store the film-actor relations
Erm, that’s not how Mongo works. I’m no expert, and my solution may not have been the best, or Mongo-est, but I know you can store lists as values (making multiple ‘tables’, or collections or whatever, unnecessary), so I suggested we would be fine to add a ‘movies’ field to the existing actor storage, which would be a list of films they’d been in, e.g:
We managed to figure out how to add the movie field to an actor, and how to append a movie to list already containing one, and we wrote this in to the script and let it run. We left in a print statement to see what was happening, which obviously slowed the process down a lot. Think about how many movies there are, and think about how many actors there are. Now think about how many instances of an actor being in a movie. That’s a lot. It took bloody ages. And didn’t seem to work. We were out of time by this point (in fact time was almost up when the program started running). Doing this properly we’d have tested it better, and ensured all data was being entered correctly. We were just having a bash at getting it to work.
All of us completely exhausted, we awaited the event closing and awards ceremony. Mike and I had stayed in the museum all night – each attempting a short nap on a couple of occasions, rather unsuccessfully in a room full of geeks bashing away at their respective keyboards. Tom had a prior engagement, so he shot off early evening, and Jack headed home later on due to problems with the wifi, and worked on setting us up an amazon instance to host the project from home.
It came to the closing and the winners of each category was named, and had a chance to give a short demo of their project. Some amazing stuff went on show – it was great to see so much innovation from so many teams. By chance, no-one else had chosen the Bacon Number challenge, so we won by default! A bit lame, I know, but the way I see it is that we weren’t so awful that they decided to withdraw! I count that as a win. And what was the prize? A brand new 512MB Raspberry Pi each! Can’t complain! Huge thanks to Intechnica for the prizes :)
Also a great big thanks to Gemma and Sean for putting the event on. It was fantastic! I will definitely enter events like this in future, even without a team – you can always group up with people and get something done. I was worried about working with people who used different languages or frameworks and that we wouldn’t be able to get things done, but we pooled ideas and skills together and managed to build some cool stuff! Also thanks to MOSI for the use of the space (all through the night!) during the science festival.
The code from our hack is available at github – it may or may not get updated/fixed in future, but at the time of writing was as we left at the end of the event
This weekend I attended the fifth (my third) PHPNW annual conference. As a member of the local PHPNW user group and community, I volunteer as a helper which involves getting delegates registered, getting the speakers to the right place and making sure everything’s running smoothly. Starting on the Friday evening hackathon social, I got chatting with a few faces old and new and once I’d eaten, got coding with Mike - we did the Ordered Jobs Kata in PHP – pairing and using PHPUnit. We continued with this, along with getting in conversations with other delegates, until around midnight – then Mike gave me a kickstarter demo on Phing – the PHP Deploy tool – which I’ve used before, but never written build scripts for, so that was a really useful session – well in to the morning!
Arriving at the conference centre bright and early, donning our new PHPNW12 red helper t-shirts we got people registered and handed name badges out. Once we had everyone settled in, the event kicked off with a truly inspiring talk from Google’s Ade Oshineye on relating API design to real world usability (like doors that are hard to work out how to open). I then shadowed Patrick_Allaert for his talk on PHP Data Structures – I learned an awful lot about the different data structures available in SPL – I had no idea they were even there. Picked up some other useful tricks and tips too.
I chatted with Ben Waine over lunch and headed over to Michael Heap‘s talk on designing systems to scale, though the room was full so I ended up hanging out in the “corridor track” with the Magma crew. In the next session I opted for the unconference track for Ben’s talk on ‘Testing Your Shit with Behat’ – cool to see Behat in action! Next up I was shadowing for Google employee Ian Barber, who gave last year’s keynote (before he worked for Google) entitled ‘How to Stand on the Shoulders of Giants‘, which is well worth a watch. This year’s on ‘How to Build a Firehose’. A really interesting talk on how to deal with exposing live data streams in real time.
Following the usual wrap-up of the day, the social happened.
Somehow I managed to get up in the morning and head back to the conference centre. The first talk was on Responsive Design at the BBC – a fantastic and intriguing talk from John Cleveley. The second talk slot had two talks I really wanted to see – Adrian‘s talk on using nginx on the Raspberry Pi, and ‘To SQL or To No(t)SQL‘ by Joroen Van Dijk - but I was down to shadow the other track, Recognising Smelly Code, which I really enjoyed – had some really good points and the speaker shared my adoration of good naming conventions and the single responsibility principle.
Another great conference – it gets better every year, without ever having been bad. Thanks to Magma for organising, and to all the sponsors and delegates for making it an awesome event. It really reminded me how great the PHP community is.
I will close with a statement expressing my opinion of Drupal:
There are some great additions, the highlights (other than a huge increase in speed, apparently) being square bracket notation for arrays, array dereferencing and the ability to use traits.
I’m quite excited (sadly) about the use of square brackets to initialise an array, and to be able to code up their contents in this way:
This is similar to the way we do lists in Python.
Array dereferencing means we can now access particular elements within an array upon creating it:
This works the same way with Objects:
There’s now something called Traits, which is a concept brought in to PHP 5.4:
This allows a trait to be reused by any objects which refer to it in this way. It’s to save on copy and pasting blocks of code. The compiler now does that for us!
Also, up to and including PHP 5.3 we could attempt to echo an array without a notice given, but the word ‘Array’ would be echoed instead of any of the array’s content. Now in PHP 5.4 a notice is given:
Although it does still echo the word ‘Array’.
Here’s a great video (the keynote at PHPUK Conference in London) of Rasmus talking about the PHP project from the beginning, and about PHP 5.4:
The ternary operator is a shorthand way of writing an if/else statement where a particular action occurs in both cases, but the value associated with that action depends on the condition stated.
For example, the traditional if/else construct (C/Java/JavScript syntax):
can be rewitten as:
This in itself is a huge benefit to clean, concise code. I use it wherever possible. Here’s an example in PHP:
A particularly cool Python example utilising the idea of a function of a comprehended list:
If you want to return/echo true or false depending on the condition, there is no need for the ternary operator as a shorter operator is available: simply echo the boolean result of the condition, i.e. rather than:
This will produce the same output:
There are various other implementations of this idea in different languages, but the reason for this blog post is because while talking about these with my colleague Mike and I came up with an interesting manipulation of this on the train to work the other day. I had a program which incremented a value by 1 if and only if a condition was true:
In my opinion this is good because it’s on one line, but bad because the else 0 should be unnecessary. Unfortunately Python requires an else here. The obvious alternative doesn’t use a +0 but requires 2 lines:
Anyway, the thing we thought of was to increment by the integer value of the boolean, i.e. 1 if True, 0 if False:
Evaluating a condition, say x>0, returns True or False, which when added to an integer is equal to 1. Another implementation of this is to multiply the value of the condition by a scaling factor: