Watchdog
1.概述
Watchdog的中文的“看门狗”,有保护的意思。最早引入Watchdog是在单片机系统中,Linux也引入了Watchdog,在Linux内核下,当Watchdog启动后,便设定了一个定时器,如果在超时时间内没有对/dev/Watchdog进行写操作,则会导致系统重启。通过定时器实现的Watchdog属于软件层面。
Android设计了一个软件层面Watchdog,用于保护一些重要的系统服务,当出现故障时,通常会让Android系统重启。由于这种机制的存在,就经常会出现一些system_server进程被Watchdog杀掉而发生手机重启的问题。
2.Watchdog 的启动
Android的Watchdog是一个单例线程,在System Server时就会初始化Watchdog。Watchdog在初始化时,会构建很多HandlerChecker,大致可以分为两类:
- Monitor Checker,用于检查是Monitor对象可能发生的死锁, AMS, PKMS, WMS等核心的系统服务都是Monitor对象。
- Looper Checker,用于检查线程的消息队列是否长时间处于工作状态。Watchdog自身的消息队列,Ui, Io, Display这些全局的消息队列都是被检查的对象。此外,一些重要的线程的消息队列,也会加入到Looper Checker中,譬如AMS, PKMS,这些是在对应的对象初始化时加入的。
private void startBootstrapServices(@NonNull TimingsTraceAndSlog t) {
......
// Start the watchdog as early as possible so we can crash the system server// if we deadlock during early boott.traceBegin("StartWatchdog");
final Watchdog watchdog = Watchdog.getInstance(); // 对象创建
watchdog.start(); // 调用start方法初始化
mDumper.addDumpable(watchdog);
......
}
Watchdog.java,创建对象,并启动内部线程,执行run方法。
private Watchdog() {
mThread = new Thread(this::run, "watchdog");
// Initialize handler checkers for each common thread we want to check. Note
// that we are not currently checking the background thread, since it can
// potentially hold longer running operations with no guarantees about the timeliness
// of operations there.
//
// Use a custom thread to check monitors to avoid lock contention from impacted other
// threads.
// 创建HandlerThread监控线程并启动,线程名称为"watchdog.monitor"
ServiceThread t = new ServiceThread("watchdog.monitor",
android.os.Process.THREAD_PRIORITY_DEFAULT, true /*allowIo*/);
t.start();
// 封装到HandlerChecker,加入mHandlerCheckers列表
mMonitorChecker = new HandlerChecker(new Handler(t.getLooper()), "monitor thread");
mHandlerCheckers.add(withDefaultTimeout(mMonitorChecker));
// 将foreground thread、main thread、ui thread等加入mHandlerCheckers列表
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(FgThread.getHandler(), "foreground thread")));
// Add checker for main thread. We only do a quick check since there
// can be UI running on the thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(new Handler(Looper.getMainLooper()), "main thread")));
// Add checker for shared UI thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(UiThread.getHandler(), "ui thread")));
// And also check IO thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(IoThread.getHandler(), "i/o thread")));
// And the display thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(DisplayThread.getHandler(), "display thread")));
// And the animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(AnimationThread.getHandler(), "animation thread")));
// And the surface animation thread.
mHandlerCheckers.add(withDefaultTimeout(
new HandlerChecker(SurfaceAnimationThread.getHandler(),
"surface animation thread")));
// Initialize monitor for Binder threads.
addMonitor(new BinderThreadMonitor());
mInterestingJavaPids.add(Process.myPid());
// See the notes on DEFAULT_TIMEOUT.
assert DB || DEFAULT_TIMEOUT > ZygoteConnectionConstants.WRAPPED_PID_TIMEOUT_MILLIS;
mTraceErrorLogger = new TraceErrorLogger();
}
public void start() {
mThread.start();
}
3.加入Watchdog监控方式
有两种方式加入Watchdog监控:
- addThread():用于监测Handler线程,默认超时时长为60s.这种超时往往是所对应的handler线程消息处理得慢;
- addMonitor(): 用于监控实现了Watchdog.Monitor接口的服务.这种超时可能是”android.fg”线程消息处理得慢,也可能是monitor迟迟拿不到锁;
// PowerManagerService.java
@Override
public void onStart() {
// 加入watchdog监听
Watchdog.getInstance().addMonitor(this); // pms服务实现了Watchdog.Monitor接口
Watchdog.getInstance().addThread(mHandler);
}
// 在watch dog中调用,用于检测PMS服务中是否存在死锁
@Override // Watchdog.Monitor implementation
public void monitor() {
// Grab and release lock for watchdog monitor to detect deadlocks.
synchronized (mLock) {
}
}
// monitor对象加入mMonitorQueue,mMonitorQueue加入到mMonitors
public void addMonitor(Monitor monitor) {
synchronized (mLock) {
mMonitorChecker.addMonitorLocked(monitor);
}
}
void addMonitorLocked(Monitor monitor) {
// We don't want to update mMonitors when the Handler is in the middle of checking
// all monitors. We will update mMonitors on the next schedule if it is safe
mMonitorQueue.add(monitor);
}
// handler对象加入到mHandlerCheckers列表
public void addThread(Handler thread) {
synchronized (mLock) {
final String name = thread.getLooper().getThread().getName();
mHandlerCheckers.add(withDefaultTimeout(new HandlerChecker(thread, name)));
}
}
public final class HandlerChecker implements Runnable {
......
public void scheduleCheckLocked(long handlerCheckerTimeoutMillis) {
mWaitMaxMillis = handlerCheckerTimeoutMillis;
if (mCompleted) {
// Safe to update monitors in queue, Handler is not in the middle of work
mMonitors.addAll(mMonitorQueue);
mMonitorQueue.clear();
}
if ((mMonitors.size() == 0 && mHandler.getLooper().getQueue().isPolling())
|| (mPauseCount > 0)) {
// Don't schedule until after resume OR
// If the target looper has recently been polling, then
// there is no reason to enqueue our checker on it since that
// is as good as it not being deadlocked. This avoid having
// to do a context switch to check the thread. Note that we
// only do this if we have no monitors since those would need to
// be executed at this point.
mCompleted = true;
return;
}
if (!mCompleted) {
// we already have a check in flight, so no need
return;
}
mCompleted = false;
mCurrentMonitor = null;
mStartTimeMillis = SystemClock.uptimeMillis();
mHandler.postAtFrontOfQueue(this);
}
......
}
4.Watchdog的监测机制
Watchdog本身是一个线程,它的run()方法实现如下:
@Override
public void run() {
boolean waitedHalf = false;
while (true) {
...
synchronized (this) {
...
// 1. 调度所有的HandlerChecker
for (int i=0; i<mHandlerCheckers.size(); i++) {
HandlerChecker hc = mHandlerCheckers.get(i);
hc.scheduleCheckLocked();
}
...
// 2. 开始定期检查
long start = SystemClock.uptimeMillis();
while (timeout > 0) {
...
try {
wait(timeout);
} catch (InterruptedException e) {
Log.wtf(TAG, e);
}
...
timeout = CHECK_INTERVAL - (SystemClock.uptimeMillis() - start);
}
// 3. 检查HandlerChecker的完成状态
final int waitState = evaluateCheckerCompletionLocked();
if (waitState == COMPLETED) {
...
continue;
} else if (waitState == WAITING) {
...
continue;
} else if (waitState == WAITED_HALF) {
...
continue;
}
// 4. 存在超时的HandlerChecker
blockedCheckers = getBlockedCheckersLocked();
subject = describeCheckersLocked(blockedCheckers);
allowRestart = mAllowRestart;
}
...
// 5. 保存日志,判断是否需要杀掉系统进程
Slog.w(TAG, "*** GOODBYE!");
Process.killProcess(Process.myPid());
System.exit(10);
} // end of while (true)
}
以上代码片段主要的运行逻辑如下: Watchdog运行后,便开始无限循环,依次调用每一个HandlerChecker的scheduleCheckLocked()方法 调度完HandlerChecker之后,便开始定期检查是否超时,每一次检查的间隔时间由CHECK_INTERVAL常量设定,为30秒 每一次检查都会调用evaluateCheckerCompletionLocked()方法来评估一下HandlerChecker的完成状态: COMPLETED表示已经完成 WAITING和WAITED_HALF表示还在等待,但未超时 OVERDUE表示已经超时。默认情况下,timeout是1分钟,但监测对象可以通过传参自行设定,譬如PKMS的Handler Checker的超时是10分钟 如果超时时间到了,还有HandlerChecker处于未完成的状态(OVERDUE),则通过getBlockedCheckersLocked()方法,获取阻塞的HandlerChecker,生成一些描述信息 保存日志,包括一些运行时的堆栈信息,这些日志是我们解决Watchdog问题的重要依据。如果判断需要杀掉system_server进程,则给当前进程(system_server)发送signal 9 只要Watchdog没有发现超时的任务,HandlerChecker就会被不停的调度,那HandlerChecker具体做一些什么检查呢?
public final class HandlerChecker implements Runnable {
public void scheduleCheckLocked() {
// Looper Checker中是不包含monitor对象的,判断消息队列是否处于空闲
if (mMonitors.size() == 0 && mHandler.getLooper().isIdling()) {
mCompleted = true;
return;
}
...
// 将Monitor Checker的对象置于消息队列之前,优先运行
mHandler.postAtFrontOfQueue(this);
}
@Override
public void run() {
// 依次调用Monitor对象的monitor()方法
for (int i = 0 ; i < size ; i++) {
synchronized (Watchdog.this) {
mCurrentMonitor = mMonitors.get(i);
}
mCurrentMonitor.monitor();
}
...
}
}
- 对于Looper Checker而言,会判断线程的消息队列是否处于空闲状态。如果被监测的消息队列一直闲不下来,则说明可能已经阻塞等待了很长时间
- 对于Monitor Checker而言,会调用实现类的monitor方法,譬如上文中提到的AMS.monitor()方法, 方法实现一般很简单,就是获取当前类的对象锁,如果当前对象锁已经被持有,则monitor()会一直处于wait状态,直到超时,这种情况下,很可能是线程发生了死锁。
5.总结
- Watchdog用HandlerChecker来监控消息队列是否发生阻塞,用MonitorChecker来监控系统核心服务是否发生长时间持锁。
- HandlerChecker通mHandler.getLooper().getQueue().isPolling()判断是否超时,BinderThreadMonitor主要是通过判断Binder线程是否超过了系统最大值来判断是否超时,其他MonitorChecker通过synchronized(this)`判断是否超时。
- 超时之后,系统会打印一系列的信息,包括当前进程以及核心native进程的Stacktrace,kernel线程Stacktrace,打印Kernel里面blocked的线程以及所有CPU的backtraces。
- 超时之后,Watchdog会杀掉自己,导致zygote重启。