How to use the watchdog timer in a RTOS?

user946230 picture user946230 · Nov 4, 2012 · Viewed 15.5k times · Source

Assume I have a cooperative scheduler in an embedded environment. I have many processes running. I want to utilize the watchdog timer so that I can detect when a process has stopped behaving for any reason and reset the processor.

In simpler applications with no RTOS I would always touch the watchdog from the main loop and this was always adequate. However, here, there are many processes that could potentially hang. What is a clean method to touch the watchdog timer periodically while ensuring that each process is in good health?

I was thinking that I could provide a callback function to each process so that it could let another function, which oversees all, know it is still alive. The callback would pass a parameter which would be the tasks unique id so the overseer could determine who was calling back.

Answer

Dan picture Dan · Nov 4, 2012

One common approach is to delegate the watchdog kicking to a specific task (often either the highest-priority or the lowest priority, tradeoffs / motivations for each approach), and then have all other tasks "check in" with this task.

This way:

  • if an interrupt is hung (100% CPU), the kicker task won't run, you reset

  • if the kicker task is hung, you reset

  • if another task is hung, kicker task sees no check in, kicker task doesn't kick WDG, you reset

Now there are of course implementation details to consider. Some people have each task set its own dedicated bit (atomically) in a global variable; the kicker task checks this group of bit flags at a specific rate, and clears/resets when everyone has checked in (along with kicking the WDG, of course.) I eschew globals like the plague and avoid this approach. RTOS event flags provide a somewhat similar mechanism that is more elegant.

I typically design my embedded systems as event-driven systems. In this case, each tasks blocks at one specific place - on a message queue. All tasks (and ISRs) communicate with each other by sending events / messages. This way, you don't have to worry about a task not checking in because it's blocked on a semaphore "way down there" (if that doesn't make sense, sorry, without writing a lot more I can't explain it better).

Also there is the consideration - do tasks check in "autonomously" or do they reply/respond to a request from the kicker task. Autonomous - for example, once a second, each task receives an event in its queue "tell kicker task you're still alive". Reply-request - once a second (or whatever), kicker tasks tells everybody (via queues) "time to check in" - and eventually every task runs its queue, gets the request and replies. Considerations of task priorities, queueing theory, etc. apply.

There are 100 ways to skin this cat, but the basic principle of a single task that is responsible for kicking the WDG and having other tasks funnel up to the kicker task is pretty standard.

There is at least one other aspect to consider - outside the scope of this question - and that's dealing with interrupts. The method I described above will trigger WDG reset if an ISR is hogging the CPU (good), but what about the opposite scenario - an ISR has (sadly) become accidentally and inadvertantly disabled. In many scenarios, this will not be caught, and your system will still kick the WDG, yet part of your system is crippled. Fun stuff, that's why I love embedded development.