The Mars Phoenix Lander Mission is a short-term mission to Mars to search for signs of water and a potential habitable site for an eventual manned mission to the Red Planet. This mission is a collaboration between NASA and the University of Arizona Lunar and Planetary Laboratory.
Sending hundreds of pounds of equipment millions of miles through space to land and operate independently from direct control presents several interesting software development challenges. O'Reilly News recently discussed the project and its technology with NASA's Peter Gluck.
Peter, can you give me a brief background for yourself?
I'm the project software engineer for the Phoenix Lander Mission. I've been with the project for four and a half years, responsible for the overall implementation of all of the software on the project, coordination of flight and ground software, successful verification and validation, all of that good stuff. So let's see. I have a background in aerospace engineering. I've got a Master's in Aerospace Engineering. Worked previously on the Mars Reconnaissance Orbiter, Spitzer Space Telescope, Deep Space One, Mars Observer.
It might be easier just to say what you haven't worked on!
Well, I don't know how far back you want to go.
You're the project software engineer. I noticed that was singular. Is there just one product software engineer? Like sort of the managing engineer for the project?
Yeah. Project Software Systems Engineer is the title. And our software was developed — the flight system software was developed in three different locations. Lockheed Martin developed the spacecraft software, and then we had payload software developed by both the University of Arizona and the Jet Propulsion Laboratory.
By payload software, do you mean software that runs the experiments and reports things back?
I assume then that all of the embedded software are running on a RAD 3000 board or something like that, a 6000 board?
RAD 6000. Yes.
That's the radiation-hardened and testing board that Lockheed Martin runs?
That's correct. Now they have — the software that actually operates the spacecraft systems was written by Lockheed Martin. And then they also host the software modules that run the payloads that run the instruments.
Is the payload software also running on that embedded board?
Yes, it is for the most part. There are a couple of payloads that have their own internal processors, but every payload has at least some sort of interface software running in the embedded.
That's not a really beefy embedded board actually. It's what, thirty-three megahertz?
Yeah. That's — yeah.
About 128 megabytes of RAM?
I imagine that produces some interesting challenges, getting all of that software to run together on that board while also having it land on the planet successfully.
Yes. It certainly does. I mean it's a venerable board. We've used it ever since the Mars Pathfinder Mission what — about 12 years ago, 14 years ago?
This is the last of the missions that are slated to fly this board. And the only reason we flew that one was because we inherited an already partially assembled Lander system from the 2001 Lander project that was cancelled.
That's why this is called the Phoenix, because you used parts from other missions?
When you talk about putting all of this software together, do you go to Lockheed Martin and say, Here's what we're doing. We're landing this on one of the poles of Mars. You know what it takes to get 700 pounds of equipment there. Go ahead and do it. Tell us what we have to work with. Or is it more of a back and forth process?
It's a joint process. What happens is in this particular case, I believe there was an announcement of opportunity from NASA for Mars Scout missions. There was a Mars Scout program that was initiated, and they requested academia and industry to submit proposals. And one of the proposals that was submitted was a joint effort between the University of Arizona, the Jet Propulsion Laboratory at Lockheed Martin to refit and fly the 2001 Lander to a northern polar region that dovetailed nicely in with the results from the gamma ray spectrometer on the Mars Odyssey Spacecraft that recently, in 2002, discovered the presence of hydrogen in the polar regions of Mars. And so the intent was to go and verify that the hydrogen was indeed in the form of water ice located at the poles.
Should I assume that the proposal, at least from the University of Arizona, included proposals for various types of instruments and experiments and missions related to that?
Right. The principal investigator, Peter Smith in this case, had a coalition of scientists that he worked with, and they established a baseline payload for the mission which they put into the proposal. And then NASA evaluated all of the proposals. And Phoenix was selected roughly in the mid-2003 timeframe.
That's pretty quick going from proposal to actually landing on a planet.
Yeah. We went — well, let's see. We got started as a project in October of 2003, maybe August, somewhere in that timeframe. And then we launched less than four years later from the Kennedy Space Center, Cape Canaveral Air Force Station.
Impressive to me, but I guess a lot of these people have done this before. I mean you certainly have.
That's a pretty typical timeline for these things. If you're going less than 30 months, that's really kind of pushing it. Typically, three to four years is about an average lifecycle for these things.
In terms of software development, can you talk a little bit about that? I mean I've read various things. I've done some software development on my own. Pete McBreen, for example, had a great book a couple of years ago called Software Craftsmanship, where he said that about the only place where you can sit down and design all of your software up front is where you actually have the hardware; you know your exact constraints, and you know you're not going to be able to update the software once you deliver it. What's the process look like there for getting software to get 700+ pounds of metal and equipment to another planet?
Generally what we do with these proposals, particularly with these kinds of missions, is we try to rely on — in order to cut costs, we try to rely on software that's been previously established and proven to work. So in this case, this is part of a line of spacecrafts that Lockheed Martin has produced, and the software has heritage literally going back to the Mars Pathfinder mission.
We've had about six successful missions to Mars from NASA?
Recently, we have Mars Global Surveyor, Mars Odyssey, Mars Reconnaissance Orbiter, the two MER missions and this one. So that would be six, yeah.
A pretty good pedigree of at least the control software in that case.
Yeah. Now the MER missions, the rovers that JPL landed were built by JPL and so their heritage is — they're probably more like distant cousins. But the Phoenix spacecraft is a twin to the Mars Odyssey spacecraft. And so the flight software on both of those is very similar with the exception of the capability required to land on the planet and then execute the surface missions.
And you probably don't have the control for roving because the Phoenix doesn't rove.
Right. Exactly. We don't have any motion control or anything like that on the surface.
The control software is pretty much a solved problem at this point. What we're looking at for the interesting parts are the applications, the experiments and all of that other control software?
Yeah. I mean for Phoenix, the most interesting parts were the entry, descent and landing, which was something that Lockheed had not successfully done in the past.
Does that keep you up at night sometimes?
It did. [Laughs]
Here's my $325 million baby. Will it land?
Yeah. That was the big day. The really big day was the entry day on May 25th. That was the thing that a lot of people worried about, that we spent a lot of resources on making sure that the system was going to work properly. The radar is a great example. We had a commercial radar which was an F-16 radar. And it was great at identifying aerial targets but not so good at figuring out whether you were safely going to land or not. And so the team — the Entry Descent and Landing team spent an enormous amount of time and resources in making sure that that radar was going to work properly, characterizing it, identifying the weak areas in the performance of the radar and then building the software so that we could work around those weaknesses.
In this three to four year process, what's the timeline look like? I mean you go from — it still seems pretty fast to me to go from proposal to actually planet fall. But I mean you get the proposal; you send it back saying, Oh hey, let's go look for liquid or solid water on this planet. How does that develop? Especially with regard to the people at the University of Arizona?
So you get a go ahead and then the funds start to develop your requirements and you have those reviewed. And we have a lifecycle that we use at JPL where we go through a requirements definition and then we go into our preliminary design phase and then we go into a detail design phase. And then at the end of detail design, you actually have most of the components ready to be assembled and you start to assemble and integrate and test the vehicle. And so roughly speaking, you spend something like nine months in your requirements definition and then another year in your design phase, maybe a year and a half. And then you spend a year to a year and a half in your test phase. And then you launch it.
It sounds like a pretty standard waterfall cycle. Is that correct?
Yeah. Yeah, it is. It's a pretty standard waterfall cycle. Now the software typically iterates a few times. What you do is you build it up with the capabilities to support the rest of the project. So if your power system is arriving and you need software to be able to run it, then you build the power software first. And typically, you build up the spacecraft bus components first. And then when the payloads arrive, you build up the payload software so that that comes in at the right time. On this particular mission, the payload actually got quite a bit of a head start on the space craft. There were some contractual issues. There were personnel issues. One of the problems that we have is phasing our development with other missions that are going on. So for example, on Phoenix, one of the challenges was that while we were trying to get our development really going in mid-2005, Mars Reconnaissance Orbiter was approaching the launch pad. And they were having trouble with some of their new hardware they had. It's always a challenge when you have new hardware.
And so they were the first of our missions to fly the RAD 750 for example, which is the next generation radiation hardened power PC product. And they had changed architectures from a VME architecture to a PCI-based architecture. And so they had a number of challenges that they faced. And it was the same core group of people that were going to work on Phoenix that had been working on Mars Reconnaissance Orbiter, and we had difficulty getting the people transferred off of MRO and onto Phoenix just because of the timing and the fact that MRO really had these problems. And, of course, it took priority because it was closer to launch. So —
And these people are having to do an architecture switch as well after they get transitioned.
Yeah. Right. Then they had to go back and say, Okay. Now I'm no longer working on PCI. I'm back on the old VME stuff that I used to know.
And you have to fight with the smaller processor and less RAM and everything.
Right. Exactly. So —
Like a challenge.
Yeah. So it is a challenge to phase all of these things. And sometimes things don't go the way you hope they will. But you work with the problems, and we carry reserves when we go into this to help fund the problems when they come up.
What does the verification and validation step look like for the software?
Yeah. From a software prospective, one of the things that's really helpful is having what we call test beds or test platforms where you can run the software in a flight-like or deployment like environment, and you can verify that the requirements are met. So the first step is to actually run the verification testing. This is after the developers developed it and unit tested it. And said, Okay, I think my module's ready for integration. Once it's all integrated, then we run the verification tests on it to verify the functionality that it meets the requirements.
Is this a manual stage or automated or some combination?
It is automated to a large extent. The capability at Lockheed permits automation and scripting of the tests. And they can be rerun. It typically — the whole battery of tests was — I think there were about 85 total tests that were run on the software at an integration level.
Per each payload or for —
No, total. Total for the — well, okay. Total for the flight system. And then the payloads had their own — each payload had roughly — well, some only had one, and some had three or four. It depends on the complexity of the payload.
Who creates these tests?
The tests are created by the — generally by the developer of the software in conjunction with the instrument or subsystem engineer.
Okay. So they develop their software for the payload. They do their standard unit testing depending on their own development practices.
They say, Okay. We believe this works. They hand it to you, and they hand you these — we'll call them customer tests for lack of a better term.
Then you run these all together on the Lockheed simulator and whatever testing systems they have.
Yeah. We had a team that ran the tests, and then they had to be reviewed by the author of the test as well as by the system engineer responsible for that instrument or that subsystem. And then in some cases, I reviewed some of them as well.
I mentioned the RAD 6000 board earlier.
For our listeners, that's a radiation-hardened power PC board. Did you do any specific environmental or radiation testing, hostile environment testing on the software as well? Or is that something that Lockheed Martin takes care of in their verification?
No, we don't do that. The RAD 6000 has built in error detection and corrections. So the hardware does RAM scrubbing. There is a RAM scrubbing that occurs on a continuous basis. And beyond that, we have internal fault protection that monitors the health and safety of the software. And if a software task, for example, fails to respond to a ping, we have pings in the system, then the fault protection task will declare that a fault has occurred and will safe the spacecraft. And what that means, by "safeing", we mean that the spacecraft will enter into a power and communications safe mode where it will just sit and wait for the ground to respond. It'll basically phone home and say, I've got a problem; somebody tell me what to do.
Yeah. Now, there's hardware built into the avionic system that if the software were to completely lock up like your PC might on occasion.
No, I run Linux. It doesn't do that.
[Laughs] Never ever?
Okay. Well, sometimes [Inaudible] or something.
Right. So if it were to completely lock-up, the hardware has to be stroked every 64 seconds. There's a watchdog timer. And so if that 64 second period expires, then the hardware resets and the software is rebooted, and hopefully that clears whatever error occurred. Now in the event that that doesn't work, we have a whole second set of avionics onboard. So the hardware will try to boot to the same side, and if the same side doesn't come up and start stroking the watchdog timer, then it will swap to the other side and boot the first side.
How often are you in contact? How often is ground control in contact with the spacecraft?
In cruise, we had contact about every three days for eight hours. During EDL, we had continuous contact for about I don't know 24-48 hours. Maybe it was longer than that. I'm not really sure. On the surface, we get at least two passes, two full communications passes a day. We actually get passes about every two hours during nighttime operations. The spacecraft has to wake up every couple of hours and verify its power state, and sometimes there's a communications pass associated with that. Generally in terms of command ability and getting science data back, we do two or three passes a day.
Mars and Earth are a ways away in space terms, I think that's the technical term. If, for example, the software had to shutdown and go into command mode, rescue mode, whatever you want to call it —
Safe mode. Right.
What's your availability of getting that information from the spacecraft and communicating it back? Is that a couple of windows? Is that one window?
Yeah. So there's a whole system of how the spacecraft responds with the communication passes once it goes into safe mode. And the spacecraft knows where the orbiter passes are going to occur. And there's a balance between preserving power and taking advantage of communications opportunities. So the system engineering team has worked out ahead of time what opportunities they would like to use. And when the timer expires for that opportunity, then the spacecraft will wake up. And it will try to communicate with an orbiter. And it'll just keep doing that as long as power holds out and the table doesn't expire.
How soon can you get new information to it to tell it how to reset?
Well, let's see. We've got to account for one-way light time. And there's a little bit of margin in terms of the relay because we don't have a direct to earth capability. We have to relay with an orbiter. But we can turn commands around in a few hours.
That's pretty good. Are there real-time concerns with regard to the payload software? I mean you have your 64 second heartbeat there.
Are all of the process there running — having to do real-time concerns? Do they have hard guarantees about how often they're going to respond about latencies and such?
Oh absolutely. For example, the robotic arm software has to run several times a second in order to ensure that the arm is moving properly, that it's not going to collide with something on the deck or the Martian surface or landing leg or whatever. All of them have some real-time aspect to them.
It sounds like an interesting, almost a packing problem, trying to figure out exactly all of the payloads you want to run and how their time concerns, how much time they want, how long it's going to take them to respond to actually work together.
Yeah. The operation of the vehicle is very complicated. And there's a lot of preparation that went into it. And there is a lot of activity that happens every day in order to determine what science activities to perform and when to perform them. There's literally scores of people in Tucson right now — well, they don't start their shift until 1:15 today, but that are going to be there today assessing the telemetry from the spacecraft, determining what the results were from yesterday, what they want to accomplish today and then preparing the uplink package that's going to go to the spacecraft later this evening.
You can change almost every day the schedule for the next day?
Yeah. Absolutely. As long as we have good communication and the spacecraft's in a good healthy state, they determine on a day-to-day basis. We call that the tactical timeline.
What the spacecraft is going to do for that day. Now in the event that something goes wrong and we don't have a good communications pass — for example, early in the mission, we tried to use the Mars Reconnaissance Orbiter's Electra radio, and that failed on us one day and refused to communicate. The spacecraft then has what we call a run-out sequence which is the background sequence that's going to execute in the event that no other commands are received. And so it'll take some — that's typically some image taking and some weather data and things like that.
Do you also have a strategic team or is that the whole mission is a strategic plan and you're the tactical team?
There is a strategic team. So there's folks that are working the tactical timeline, and there are folks that are working the strategic plan for what are we going to be doing a week, 10 days, 15 days out. And that changes on a day-to-day basis depending on the new discoveries. When they find ice in a trench, then they want to go and they want to work on that right away. If they've got other things they want to examine down the road and those get planned for later times. And there's also engineering work that happens. There's software updates that we can update the software in flight. And, in fact, we have made some updates to it.
Can you tell me a little more about that?
One of the things about the software, one of the unique aspects about working software on a spacecraft is it's the only part of the system that you can change after launch. So typically, we discover things as we are doing our testing that we may be able to live with but would be better if we could fix them or work around them. And in one case on this particular mission, there's a problem with one of our interface boards where it's susceptible to — due to an internal design conflict, it's susceptible to resetting occasionally. And so our software was not initially designed to be robust to that kind of event. To prevent any damage to the instruments, we went ahead and we improved the software so that the instruments could not be harmed by a reset or some bad bits coming across the serial line. The software updates went up on the surface. So we have to uplink the files, and then we have to reset the spacecraft software. We have to reboot the software, and then verify that it's all working correctly. And then we could proceed with our science operations.
How big are those updates typically? I'm just trying to get a figure of how much communication time you're looking at.
Yeah. It varies. Some of the spacecraft modules — the payload modules are quite small on the order of say tens of kilobytes and others are much, much larger, hundreds of kilobytes. So it just depends on which one we're updating and what the capabilities are that we need. We do have the capability to patch specific addresses. And so we can send up very, very small patches if we need to. But patching is more complicated, and we prefer to just send up an entire object if we can.
You don't want to break your spacecraft.
Yeah. We're very careful about that. We test everything on the ground. We have a full avionic system in Tucson. We call it the payload test laboratory. And we've got a full Lander deck with all of the instruments represented by an engineering model. And so we can test every software update. We can test every command sequence on the ground. Of course, we have to compensate some because we're working in Earth gravity. For example, so the arm actually has a tractioning system. There's a series of pulleys and lines that are holding the arm up and providing some additional force on the arm, upward force to counteract the stronger gravity here on the Earth. But we've got this whole system there that we can test everything on. And we make sure we run everything through that system before we send it up to the spacecraft.
If you have a problem, you try it on this one and you know it's not going to work.
Right. Yeah. So hopefully we operate the way we tested, and the test results represent the actual flight environment, and we'll be able to detect any errors on the ground before they make it to the spacecraft.
Makes sense. I mean if you can afford two Phoenixes, why not?
[Laughs] Well, we can't afford not to have it. So that's the thing. You don't want to be experimenting on your flight vehicle that's 150 million miles away and you only have 90 days to work with it.
I'm looking at some information from the Ames Research Center that looks like an announcement about a software tool called Ensemble. Are you familiar with that?
I am not.
Okay. It says it's based on Eclipse software, the IDE for Java, part of the tool for the Phoenix Science Initiative used on the Mars Phoenix Lander.
We do have a tool called PSI. I didn't realize it stood for Phoenix Science Initiative. I thought it stood for —
Phoenix Science Interface I think?
Phoenix Science Integrator I thought is what it stood for. But yeah and sometimes people define things a little bit differently. I saw a definition for CPU today that said, Computer processing units. So but yes, we do have a tool called PSI that is used to integrate the plans for the various science teams. Right. Each instrument has its own team that is associated with it, and so there's a robotic arm camera team. And there's an SSI camera team. And there's a meteorology team. And there's a TEGA team. And all of these different teams have competing interests, right? I mean the SSI camera team would be happy to take pictures of the horizon all day long, and that's all they care about.
I mean I shouldn't say that's all they care about, but that's their main interest.
And those are pretty pictures too.
And they are pretty pictures. And they take pictures of the sky. They coordinate with the liDAR, and they take pictures of the sky when the liDAR is firing. There's a little tell-tale which is a dangling bob that's mounted on top of the meteorology mast, and they take pictures of that. And they can measure the wind speed and direction from the deflection of the tell-tale. So there's all that going on. And then you have the critical science instruments which are the analyzers, the MECA and the TEGA which are part of our main mission requirements. And so they need to get their runs in. TEGA runs can take hours. That's the thermal analyzer. Right?
I don't know how familiar you are with these acronyms I'm throwing out. Those take several hours. And so all of this has to be integrated into a coherent plan. And so the PSI tool supports that integration of these various competing interests into a plan that A, won't exceed spacecraft resources; B, won't exceed mission downlink resources and C, will provide a good compromise of science for the science team.
I guess it's the job of the tactical team to coordinate all of these various requests. They say, Oh, here's something really interesting. We want to look at this rock tomorrow all day.
Right. Exactly right.
How much input do the people at the University of Arizona have into that process?
Well, the University of Arizona is well-represented. The TEGA and the SSI and the RAC cameras were all developed at the University of Arizona. And so their teams are largely comprised of folks from U of A. In the case of the RAC, there's also some folks from Max Planck Aeronomy in Germany. There are other U of A folk that are on some of the other teams. But the scientists are coming from all over. There are Canadian scientists that are on the MET team there. There are scientists from JPL. In fact, I couldn't even tell you where all of the scientists are from. There's a large contingent of scientists there. Some of whom I've never been introduced to. [Laughs]
Scheduling all of those disparate interests might be a really contentious job.
Right. Which is why we have PSI. So PSI helps to do the modeling and provide the assessment necessary for the mission managers to adjudicate who's going to get to go and who isn't.
Say, "Don't blame me, it's just the software."
I've said that before.
PETER: Well, it's a delicate balancing act between what's in the best interest of the mission and what each science team would really like to accomplish.
The only reason I brought that up is because one of my editors said, Oh look, they have Java on this thing.
Oh, Java. Well, we have Java in the ground system not onboard the spacecraft.
Right. That's what it's starting to sound like.
That's right. Yeah. The spacecraft software is entirely in C.
C? Really? That surprises me a little bit.
Yes. It's entirely in C.
I thought Lockheed Martin was a big Ada shop for this sort of thing.
Ada is used largely in military applications, but JPL at any rate has moved away from Ada. Cassini, I believe, would be the last JPL mission that used Ada. And that was largely due to the success of the Mars Pathfinder in the mid-nineties. And as I said, these missions are to a large extent all derived from Mars Pathfinder.
After that successful mission, you say, Hey, we could do it in C now. That's not as scary as everybody thought?
Yeah. Right. And we've been running VxWorks as our real-time operating system. I believe they all run VxWorks. I couldn't speak for some of — let's see. We've been branching out in contractors a little bit. I believe — you know what? I'm not sure what Ball Aerospace is using in their spacecraft. And I am not sure what Orbital Science is using in their spacecraft. But the Lockheed Martin line does use VxWorks as the operating system. In addition to the six Mars missions we mentioned, we've done another half a dozen Lockheed Martin missions over the last 12-15 years.
You have a high degree of confidence in the whole process there now?
Actually, yes, I do. I think that they have a very good set of processes there. They have a very good team of people. Our other contractors also have good people, and they're coming up to speed on what they need to do in order to have successful missions. Deep Impact was a very successful mission that Ball Aerospace produced for us. DAWN is up there thrusting away. That's an Orbital Sciences mission. So we work with our contractors to engage them and help them understand what it is that we need. I mean JPL has 50 years of experience in planetary space exploration.
And we don't expect someone who's relatively new to the field to understand everything we do. We have had a longer working relationship with Lockheed Martin, so it's more mature.
But this is still rocket science.
Oh, absolutely. There's nothing routine or commonplace about it. Every mission is unique and presents unique challenges and unique risks that we do our best to try and minimize. But EDL is by no means a sure thing.
I think we calculated our probability at somewhere in the nineties probability of success, but —
That's pretty good for putting something on a rock flying through space hundreds of millions of miles away.
Well, we think it is. And we really had a perfect EDL. I mean it just went straight down the pipe in terms of being right where we had hoped it would be. But we were robust to contingencies, and except in the very worst of circumstances, we still could've made it down safely and hopefully had a descent mission out of it. As it is, we landed on a huge sheet of ice. And not only was EDL perfect, but the landing location is absolutely perfect for this mission.
Icy, that's what we wanted.
That's what we wanted.
Sort of on a different topic, I have a quote here. One of our editors talked to Frank Hecker from the Mozilla Foundation the other day.
In that talk, he suggested that all software developed by the Federal Government should be released to the public domain or a very, very liberal open-source license. That's not even a copyleft license. Does the American public have any access to the source code currently on the Phoenix? Are there plans to make some of the source code available?
Well, no. There are no plans to make that available. And one of the issues that we have is that our spacecraft are designated as subject to international trafficking and arms regulations. So even —
Crypto regulations in exporting and such?
Yeah. Yeah. I mean even though these are not military spacecraft, the technology used in them is space technology. And so the State Department does not allow us to release anything that we've done in terms of technical details to foreign scrutiny. Now, in fact as I said, we have a team of Canadians. The Canadians delivered our meteorology instruments, and we had to be very careful about our relationship with them and how much we could disclose to them.
I can see that in applying control software, but how about the payload software?
Even the payload software — in this particular case, remember that the payload software operates within the confines of the RAD 6000 that contains the spacecraft software. And although the newer versions of real-time operating systems allow you to compartmentalize better, the older ones are just global name space. So there really wasn't any way to allow them to provide software for the MET instruments. So we had to define an interface and build the software at JPL, and then do our integration testing. And we worked closely with the Canadians in terms of the integration testing and making sure that the software was going to do what they needed it to do.
But we could not actually release the source code to them.
Am I right in assuming that there's very little process separation in the older RAD 6000 boards?
There is no process separation. I mean basically what we —
One bad pointer in one module and —
— you wait for the next update window.
Wow. I could write software like that.
[Laughs] Well —
I won't give you a 90 percent certainty though.
We have strict coding guidelines that we use. We don't allow dynamic memory allocation, for example.
Transcript provided by Wendy Smith
Image credit: NASA/JPL-Calech/University of Arizona