[1] ZHENG Z, LI Y, LAN Z. Anomaly localization in large-scale clusters[C]//Proceedings of the 2007 IEEE International Conference on Cluster Computing. Piscataway, NJ:IEEE, 2007:322-330. [2] AVIZIENIS A, LAPRIE J C, RANDELL B. Fundamental concepts of dependability[R]. Newcastle:LAAS-CNRS, 2001:4. [3] JORDAAN J F, PATEROK M. Event correlation in heterogeneous networks using the OSI management framework[C]//Proceedings of the IFIP TC6/WG6.6 Third International Symposium on Integrated Network Management with Participation of the IEEE Communications Society CNOM and with Support from the Institute for Educational Services. Amsterdam:North-Holland Publishing Co., 1993:683-695. [4] NATU M, SETHI A S. Active probing approach for fault localization in computer networks[C]//Proceedings of the 20064th IEEE/IFIP Workshop on End-to-End Monitoring Techniques and Services. Piscataway, NJ:IEEE, 2006:25-33. [5] LGORZATA STEINDER M, SETHI A S. A survey of fault localization techniques in computer networks[J]. Science of Computer Programming, 2004, 53(2):165-194. [6] KATKER S, PATEROK M. Fault isolation and event correlation for integrated fault management[C]//Proceedings of the 5th IFIP/IEEE International Symposium on Integrated Network Management. Berlin:Springer, 1997:583-596. [7] CHENG L, QIU X, MENG L, et al. Efficient active probing for fault diagnosis in large scale and noisy networks[C]//Proceedings of the 29th IEEE International Conference on Computer Communications. Washington, DC:IEEE Computer Society, 2010:1-9. [8] PATIL B M, PATHAK V K. Survey of probe set and probe station selection algorithms for fault detection and localization in computer networks[J]. IEEE Transactions on Networks and Communications, 2015, 3(4):57. [9] 孟洛明,黄婷,成璐,等.支持多故障定位的探测站点部署方法[J].北京邮电大学学报,2009,32(5):1-5.(MENG L M, HUANG T, CHENG L, et al. Probe station placement for multiple faults localization[J]. Journal of Beijing University of Posts and Telecommunications, 2009, 32(5):1-5.) [10] HUKERIKAR S, DINIZ P C, LUCAS R F, et al. Opportunistic application-level fault detection through adaptive redundant multithreading[C]//Proceedings of the 2014 International Conference on High Performance Computing & Simulation. Piscataway, NJ:IEEE, 2014:243-250. [11] GARDNER R D, HARLE D A. Network fault detection:a simplified approach to alarm correlation[C]//Proceedings of the 16th IEEE Global Telecommunications Conference. Washington, DC:IEEE Computer Society, 1997:44-51. [12] SCHROEDER B, GIBSON G. A large-scale study of failures in high-performance computing systems[J]. IEEE Transactions on Dependable and Secure Computing, 2010, 7(4):337-350. [13] JAKOBSON G, WEISSMAN M. Real-time telecommunication network management:extending event correlation with temporal constraints[C]//Proceedings of the Fourth International Symposium on Integrated Network Management IV. London:Chapman & Hall, 1995:290-301. [14] LEMARINIER P, BOUTEILLER A, KRAWEZIK G, et al. Coordinated checkpoint versus message log for fault tolerant MPI[J]. International Journal of High Performance Computing and Networking, 2004, 2(2/3/4):146-155. [15] SCHROEDER B, GIBSON G A. Understanding failures in petascale computers[C]//Proceedings of the 6th Scientific Discovery through Advanced Computing Conference. Bristol:IOP Publishing Ltd, 2007:2022-2032. [16] 武林平,孟丹,梁毅,等.LUNF——基于节点失效特征的机群作业调度策略[J].计算机研究与发展,2005,42(6):1000-1005.(WU L P, MENG D, LIANG Y, et al. LUNF-a cluster job schedule strategy using characterization of nodes' failure[J]. Journal of Computer Research and Development, 2005, 42(6):1000-1005.) [17] BAILTY D, HARRIS T, SAPHIR W, et al. The NAS parallel benchmarks 2.0:NAS-95-020[R]. Washington:NASA Ames Research Center, 1995:12. |