Media publishing workflows using NServiceBus
About this video
See how Spotlight used NServiceBus to decouple legacy systems from their hosting environments.
🔗Transcription
- 00:00 Dylan Beattie
- I worked for a company called Spotlight. Today I'm going to talk to you briefly about the ways in which Spotlight use NServiceBus in our software infrastructure. Spotlight's a, we're a casting directory. We're basically like a Yellow Pages for actors, actresses, television presenters. Almost all of the casting that goes on in this country goes through our system. We're doing the new Star Wars movie, we've done Game Of Thrones. We did most of the Pirates Of The Caribbean systems.
- 00:30 Dylan Beattie
- Thank you. We've been running since 1927. We first went online, we went to CD ROM in the early '90s. We were online in '97. I've been working with them since 2000. Generally when people ask what I do, I tell them that I run a website that helps actors get jobs. Which is a pretty succinct way of summing it up. Behind the scenes at Spotlight, we, historically, all our core data is in Microsoft SQL Server, which obviously, takes you down a certain path, when it comes to choosing your technology stack and things.
- 01:07 Dylan Beattie
- We've got bits of Classic ASP is still knocking around. We've got lots of various systems. We have, I think, a pretty typical example of trying to take a big ball of mud that makes a lot of money, and decouple it into lots of smaller, more manageable isolated systems, without jeopardizing the revenue along the way. We have a pretty open attitude to open source.net, and what they used to call the old net stack, which is how I first got involved with working with NServiceBus, and the idea of distributed systems and message queue.
- 01:42 Dylan Beattie
- So the first scenario where we actually looked at doing this, and implemented a system for real. When Spotlight first started, we'd been a publishing company, a book publishing company, as I said, for a lot of years. And because we're a publishing company, a lot of our revenue stuff is based on publication. People used to pay to go in a book, and lots of those people have been members for 10, 15, 20, in some cases, their entire lives. So a lot of our business processes that might not look like they're coupled to something as trivial as putting photographs on the web. You look at pictures on the web, solve problem, easy. Except because our photography publishing system is linked to a book publishing system, which is linked to a revenue system, we actually had quite a lot of challenges around, we were restricted in what we were able to do, with the core system that stores most of our photography data.
- 02:32 Dylan Beattie
- They got rid of using acid and bromide plates about 10 years before I started working there. When I got there, they still had people scanning hard copy 10x8's, and a huge machine with loads of bromide chemicals and things in one of the downstairs offices. And the system evolved from there, evolved into a pulse script. We had this notion of a central file server, back when traffic was relatively light and a lot of people weren't, the web wasn't as popular as it is these days. That approach worked. But the problem we were having, is we got to a point where our website was grinding. We did some analysis, and 70% of our CPU time and a similar proportion of our network traffic, was because every time anyone asked for a photograph of an actor on the web, the servers that we're serving it, had to connect to our central file server, and determine whether the photograph had changed.
- 03:22 Dylan Beattie
- Because of these restrictions of our business process I talked about, we couldn't do something sensible, like just give a photograph one file name, and then if you have a different picture, it has to have a different file name. That makes these problems go away. That's how you do it if you were starting from scratch. So we have this problem, we have a central file server, which stores 100,000, 150,000 high res, high quality publication-ready images. We needed a way of delivering those as thumbnails on the web quickly, that didn't require lots of network traffic, lots of CPU-time and processing on the individual clusters.
- 03:59 Dylan Beattie
- We hit this problem around the time that my team and I started looking at distributed systems and message queuing as a way of doing things. Now, one of the reasons why I was keen to talk about this, it's quite often when, certainly when we were looking at it, you'd see examples of very big enterprise systems, like procurement systems, customer processing, and order processing, where effectively message queuing is the backbone of the entire enterprise. And you'd see the small examples you'd get in tutorials and training sessions. This, to me, is an interesting example. Message queuing is not yet the backbone of what we do, but this is somewhere where we were able to take one part of the service we provide and improve it a lot by moving it onto a distributed system.
- 04:44 Dylan Beattie
- So the system we come up with is basically, "Okay, instead of asking the file server every time someone asks you for a photograph, why don't we get the file server to tell everything when a photograph changes?" Then all of the web servers, we've got about five of them running in a cluster, all of them will have enough information held locally, that whenever someone asks for a photograph, they'll know whether the one they've currently got in their cache, is good or not. It's very simple. It solved the problem for us, in terms of, eliminating that server and network traffic we were concerned about.
- 05:17 Dylan Beattie
- But basically the architecture we'd come up with. We've got a file server. So File Server just sits here. FileSystemWatcher's a .NET-component, I'm sure most of you will have used at least once or twice in your illustrious careers. This is a notifier component, which I'll show you the code for in a second. That's basically, it's a NServiceBus. I want to run it, start up, it's hosted in-process, as a windows service, by NServiceBus, hooks into the FileSystemWatcher. When a file changes, it pushes a message out onto the bus, that just says to all the web servers, "By the way, this photograph here, it's now changed. Next time someone asks for it, make sure you come back and get a fresh copy, because the one you've got is currently out of date." Simple. The only other thing that we did in here is that this notifier process will also continually cache that data into SQL server.
- 06:11 Dylan Beattie
- The reason for this, if we need to take any of these systems down for maintenance, any changes that take place won't necessarily be broadcast to the other systems. So as long as this notifier keeps running, file system changes will get pushed into the database. So if we need to restart IIS or spin up one of the web servers again, it can pull the latest state of our entire application from a database, instead of having to scan the entire file server to do it. So I'm just going to show you very quickly the code that puts this together, because it really is quite a simple solution. So the basic core components of it, the notifier thing here, the notifier serves itself.
- 07:31 Dylan Beattie
- It's pretty straightforward, notifier service, I want to run a start up, which, NServiceBus convention for this thing's going to run as a service. MediaFileServer here is our abstraction over the file system. The file system notification, FileSystemWatcher component, FileSystemMonitor is the thing that actually ... Sorry, that's the bit that contains the FileSystemWatcher. Then the notification publisher is the wrapper around the actual NServiceBus-stuff. Very ... Sorry?
- 08:04 Speaker 2
- Zoom in?
- 08:04 Dylan Beattie
- Zoom in? Better?
- 08:04 Speaker 2
- More?
- 08:09 Dylan Beattie
- Better?
- 08:10 Speaker 2
- Yes.
- 08:13 Dylan Beattie
- So the message types themselves, we've got a FileCreatedMessage, FileChangedMessage, FileDeletedMessage, nothing particularly elaborate. Enumeration that says the types of changes that can occur on a file and then bus.Publish. So when a message gets pushed through, thing picks it up, pushes it onto the bus. And then there is a component running on all of the web servers, which is the PhotoWeb component. Big enough? And this thing here, basically, it's using Windsor, it's using NServiceBus, it's subscribes to the notification types.
- 09:21 Dylan Beattie
- So when a notification message comes in, it's subscribed to it. It picks up the update cache, and then the two things that we store in the cache on the web server itself, these are held in memory, so there isn't even database lookup for these. We want the file's LastWriteTime and the file's Checksum. We pushed both of those out in HTTP headers to the client whenever anyone downloads a photograph. So in most cases, when the browser says, "Hey, give me this photograph. By the way, I have a version of it from yesterday, it was modified at 12:00, on June the fifth, and it has the following Checksum," most of the time, this system, without doing any database or file system lookups, can go, "Yeah, you're good, 304, not modified. Use the one you've already got, it's fresh."
- 09:59 Dylan Beattie
- If anything changes, the web server gets a notification from the core infrastructure, that this picture has now been modified. Client sends in the request. Web server goes, "Hang on a sec. Your copy is out of date. Let me just spin you up a fresh one." That goes into the local cache, gets sent down with a fresh set of headers. And that's basically how it works. It's not terribly complicated, but it did solve, what looked to us, like a pretty complicated problem.
- 10:26 Dylan Beattie
- Okay, so the second scenario where we're using NServiceBus and message queuing and stuff. This is a Greenfield project we built about two years ago. Basically, it's uploading video and audio clips and transcoding them, making them available for publication online. Now, how ... Right, what I'm going to do with this one, is I'm going to do a little live demo. We're actually going to capture a piece of video, and push it through the system. While it's processing through the backend, I'll talk you through how it works and then we'll be able to go in and see it up and running on the other side.
- 10:59 Dylan Beattie
- So this is going to be the NSBCon 2014 official audience selfie video. If you guys all want to give a nice big wave to the camera. Cool. So I'm going to email that to myself. Okay. So while that's going through. Architectural requirements for this, obviously, video files are big and they're messy. There are who knows how many codecs, how many video formats, how many container formats. They tend to be big. A lot of our audience's typical workflow, they'll go and they'll get a copy of a DVD from someone, they'll rip the raw MPEG footage off the disk, because they found a thing on Google telling them how to do it, and then they'll want to upload that directly to us. So you'll be talking, it's not unusual to see uploads of one, 1.5 Gigabytes for a small video clip, because that's the only format someone's got it in. The people who work for us, they're great actors, but they're not digital video experts. So we needed to build a system that could handle this sort of diversity and volume of data, whilst remaining responsive.
- 12:37 Dylan Beattie
- There's obviously a lot of moving parts in the process. You've got to take the thing, analyze it, check it, extract the thumbnails, transcode it, make it available for playback. And we want it to do all of this in a way that didn't grind to a halt every time someone put a one Gig video file into it. We were also very keen, at this point, we didn't want this to be, in any way, coupled to any of the existing stuff we had. Minimal coupling. So we exposed a handful of entry points on an API, into our legacy database. Membership status, basically, it just query it to see whether someone, are they logged in and are they allowed to upload video? And then everything else is handled within the system by being decoupled.
- 13:48 Dylan Beattie
- So this is the FrontEnd system, which we expose to the customers, which lets them go and upload and handle their own video. So I'm going to grab that selfie video we just shot. Right while that is flowing through the distributed message passing backend system we've got on it, the way the architecture works. Vanadium is the code name, internal code name for this whole system. So the first thing we've done, the web server that shows you the pages, is not the same web server that you submit the uploads to. That means that the FrontEnd on the UI remains responsive even if people are uploading big quantities of video. So the Vanadium front-end is the bit I just showed you. The Vanadium.Ingester is the bit that's just received that video file. When it does, it's going to broadcast a FileReceivedEvent, on a queue called the ingester queue. There's a component called the Wrangler, which acts like a broker in this architecture. Basically that's the bit that coordinates the various other systems.
- 15:11 Dylan Beattie
- So once that gets in, that's going to fire up, we have a ClipChecker, which is a .net wraparound, an open source video library called fffmpeg. This will do a first pass validation. It'll extract the thumbnails that we're going to use for that video. It'll give us the video length, so we can make sure it's not going to take the customer over their quota. At this point, we've got a Windows service that's wrapping native executable, ffmpeg isn't .net, it's not managed code. So obviously this is somewhere where we got a lot of benefit by having decoupling, using a queuing architecture. This normally takes a couple of seconds. Sometimes if things are busy or we're having congestion problems on that box, it can take a lot longer. But the point is the thing goes up, ClipChecker does its job. At some point, like I said, it's normally pretty quick, it'll either say it succeeded or failed.
- 16:00 Dylan Beattie
- If it failed, we'll publish the thing to the UI. So next time, the customer logs in to see what's going on, they'll get a thing saying, "Sorry, this one didn't work." Normally what will happen is the clip check will have succeeded. At that point, now the actual video transcoding and hosting is handled for us by a hosting partner. We don't do our own transcoding and we don't host the video ourselves. We had some integration issues with their API, because the notification events that we expected to receive from them, actually, they don't tell you a video is playable. They just tell you that they've got the video. So you immediately try and play it back. And there's nothing there yet, because it hasn't been through their encoding pipeline yet.
- 16:39 Dylan Beattie
- So what we do, we have a dedicated uploader service. So once a clip is ready for publication, this thing will use FTP to send up a clip and an XML manifest, describing the metadata for that clip, which member of ours does it belong to, clip duration, all this kind of stuff. We send a ClipSubmittedEvent to another service, which is a thing called a Pollster. As soon as that gets to the event, it starts pinging the video cloud every, I think, it's every five seconds. We tweaked it a little bit, but it's basically pinging the URL that we've said published this video at that web address, and waiting for a 200 Okay, meaning it's gone through the pipeline, it's up and running and it's ready to play.
- 17:16 Dylan Beattie
- Sooner or later, the Pollster will either get a, "Okay, the clip is good to go. It's ready to play," and mark it as published. Or it will give up trying. We have a timeout of about 90 minutes on this. So if a video has been sat there for 90 minutes and it hasn't come through yet, at that point, we give up and assume something has gone wrong. And the outcome of that whole system is, still processing.
- 18:03 Dylan Beattie
- So we'll pop back and check that in a second. Things that we wish we'd know. Now I should emphasize before I begin this bit, this was 2012. It was using NServiceBus Three. Now the core message passing infrastructure, and the idea of publishing, subscribing, sending commands and so on has been pretty rock solid with NServiceBus for as long as I've been using it. But this being our first serious foray into an event driven system, we discovered a lot of things after we built it, that we wished we had known a little bit more about before.
- 18:37 Dylan Beattie
- The biggest one in a nutshell, this was NServiceBus running on top of MSMQ, and MSMQ is messy. We found it radically different to any other system that my team and I had ever worked with. It's very sensitive to configuration changes. As you know, it took us a little while to work out is a message queue considered part of a development environment? Or do different environments share the same queues? Are queues native to a specific box? How do you manage your queues addresses? What about your config transforms? We did find there's a commercial tool called QueueExplorer, which we found very useful when we were putting the system together.
- 19:19 Dylan Beattie
- This system, the one I'm showing you, the media system here, is deployed to AWS. We had a couple of headaches around that. Where for example, if you shut down and bring a box up again, it'll often come up with a different IP address, or a different host name, because Amazon assigns these things dynamically. So the situation we went with, we used Elastic IPs to make sure these boxes had consistent addresses. And therefore we could address the queues. Again, this is something where I think it's probably MSMQ and Amazon not playing nicely together. And with knowing more than we did then, about the way Amazon, AWS, and virtual private networking and stuff work, probably something we could have worked around. But again, something that we wish we'd known more about.
- 20:06 Dylan Beattie
- This one was interesting. We, particularly our operations team, event-driven systems don't fail like other systems do. Something goes down, that doesn't necessarily mean there's immediately a problem. It doesn't mean you have a service interruption. It means something is unavailable for a second. If you've built it right, all the important stuff will be still be sat in a queue, waiting for it to come back up. The temporary network outages that cause a panic when you're talking about WCF, or REST APIs, with this kind of stuff, you need to think about your monitoring a little bit differently. And so instead of monitoring response times, monitor your queue lengths. Are starting to back up? Are things ending up in error queues that should've gone into another queue or been accepted?
- 20:47 Dylan Beattie
- And make sure the entire team understand the way the system works. Because for somebody who's experience is managing web servers and mail servers, when you say, "Oh, look, this is our new distributed media architecture," there's going to be a lot of things in there, which probably don't work along patterns that are familiar for them. And, yeah, sorry, this is event-driven systems. They have latency. Things will not always ... The fact you don't get an immediate response, doesn't necessarily mean something is broken. So think about the way you're monitoring it.
- 21:19 Dylan Beattie
- As I said at the beginning, this was built in 2012 on NServiceBus Three. Actually, if you look at the particular platform now, a lot of the stuff that was shown off yesterday and this morning, Service Pulse would have taken away a lot of those monitoring headaches for us. Service Insight would have taken away a lot of the visibility. It would've made it a lot easier to share knowledge amongst the team about how the system hung together, how the various message flows and pipelines and workflows and things were behaving.
- 21:51 Dylan Beattie
- And so that, personally, it's been really interesting having the firsthand experience of trying to address these problems the hard way, and seeing how the platform has evolved to actually, become a platform as opposed to a framework. It's not just something that will help you build systems, it's something that'll help you debug systems, monitor systems, troubleshoot them, run them in production, get a good handle on when you might need to provision some more resources, or some more servers, because things are starting to slow down a little bit. And so that's, like I said, I think an interesting perspective on the development that's happened in NServiceBus over the last couple of years.
- 22:30 Dylan Beattie
- Before we go to questions, there is, with no sound, the official NSBCon 2014 video selfie, powered by NServiceBus. We have one minute for questions. The answer to any question, "Did you have problems with MSDTC," is yes. The second half of the question is generally not relevant. No, we had a lot of headaches deploying MSMQ and DSTC. We basically had to open so many ports, it started looking like we were really doing something wrong. And several of the alternatives that we would have looked into, if we'd had more time, would have been using HTTP Transport, or switching to something like RabbitMQ instead of MSMQ as the underlying transport layer. Or deploying all of our AWS instances into a virtual private cloud, so that they had no security between them. The boxes can talk to each other through wide open network, and then putting the security in place in front of that. But short answer, yeah, we had a couple of headaches with it. Anything else? Cool. Thank you very much.