Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter Open Access April 2, 2013

A job checkpointing system for computational grids

Mohammed Amoon EMAIL logo
From the journal Open Computer Science

Abstract

Fault tolerance is an important property in computational grids since the resources are geographically distributed. Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant scheduling system based on checkpointing technique is presented and evaluated. When scheduling a job, the system uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling decisions. The system uses the failure rate of the assigned resources to calculate the checkpoint interval for each job. Extensive simulation experiments are conducted to quantify the performance of the proposed system. Experiments have shown that the proposed system can considerably improve throughput, turnaround time, grid load and failure tendency of computational grids.

[1] Amoon M., A fault-tolerant scheduling system for computational grids, J. Comput. Electr. Eng., 38, 399–412, 2012 http://dx.doi.org/10.1016/j.compeleceng.2011.11.00410.1016/j.compeleceng.2011.11.004Search in Google Scholar

[2] Avizienis A., The N-version Approach to Fault-Tolerant Software, IEEE Trans. Software Eng., 11, 1491–1501, 1985 http://dx.doi.org/10.1109/TSE.1985.23189310.1109/TSE.1985.231893Search in Google Scholar

[3] Buyya R., Murshed M., GridSim: A Toolkit for the Modeling and Simulation of Distributed Resource Management and Scheduling for Grid Computing, J. Concurrency Comput.: Pract. Experience, 14, 1175–1220, 2002 http://dx.doi.org/10.1002/cpe.71010.1002/cpe.710Search in Google Scholar

[4] Chtepen M., Dhoedt B., Cleays F., Vanrolleghem P., Evaluation of replication and rescheduling heuristics for gird systems with varying resource availability, In: Proceedings of 18th International Conference on Parallel and Distributed Computing Systems (Nov. 13-15, Anaheim, CA, USA), 622–627, 2006 Search in Google Scholar

[5] Chtepen M., Claeys F., Dhoedt B., Turck F., Vanrolleghem P., Demeester P., Providing fault-tolerance in unreliable grid systems through adaptive checkpointing and replication, In: Proceeding of Intl. Conf. on Computational Science (27-30 May, Beijing, China), 2007 10.1007/978-3-540-72584-8_60Search in Google Scholar

[6] Chtepen M., Claeys F., Dhoedt B., Turck F., Demeester P., Vanrolleghem P., Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids, IEEE Trans. Parallel Distrib. Syst., 20, 180–190, 2009 http://dx.doi.org/10.1109/TPDS.2008.9310.1109/TPDS.2008.93Search in Google Scholar

[7] Chtepen M., Claeys F., Dhoedt B., Turck F., Demeester P., Vanrolleghem P., Adaptive checkpointing in dynamic grids for uncertain job durations, In: Proceedings of the 31st Intl. Conf. on Information Technology Interfaces (22-25 June Dubrovnik, Croatia), 585–590, 2009 10.1109/ITI.2009.5196152Search in Google Scholar

[8] Domingues P., Silva J., Silva L., Sharing Checkpoints to Improve Turnaround Time in Desktop Grid Computing, In: Proceedings of the 20th Intl. Conf. on Advanced Information Networking and Applications (18-20 April, Vienna, Austria), 301–306, 2006 10.1109/AINA.2006.309Search in Google Scholar

[9] Khan F., Qureshi K., Nazir B., Performance Evolution of Fault Tolerance techniques in Grid Computing System, J. Comput. Electr. Eng., 36, 1110–1122, 2010 http://dx.doi.org/10.1016/j.compeleceng.2010.04.00410.1016/j.compeleceng.2010.04.004Search in Google Scholar

[10] Khanli L., Far M., Rahmani A., RFOH: A New Fault Tolerant Job Scheduler in Grid Computing, In: Proceedings of the 2nd Intl. Conf. on Computer Engineering and Applications (19–21 March Bali Island, Indonesia), 422–425, 2010 10.1109/ICCEA.2010.88Search in Google Scholar

[11] Legrand A., Marchal L., Casanova H., Scheduling Distributed Applications: The SimGrid Simulation Framework, In: Proceedings of Third Intl Symp. Cluster Computing and the Grid, 138–145, 2003 10.1109/CCGRID.2003.1199362Search in Google Scholar

[12] Mehta J., Chaudhary S., Checkpointing and recovery mechanism in grid, In: Proceeding of 16th Intl. Conf. on Advanced Computing and Communication (14–17 Dec. Chennai), 131–140, 2007 10.1109/ADCOM.2008.4760439Search in Google Scholar

[13] Nandagopal M., Uthariaraj V., Fault Tolerant Scheduling Strategy for Computational Grid Environment, Int. J. Eng. Sci. Technol., 2, 4361–4372, 2010 Search in Google Scholar

[14] Nazir B., Qureshi K., Khan F., Adaptive checkpointing strategy to tolerate faults in economy based grid, Journal Supercomputing, 50, 1–18, 2009 http://dx.doi.org/10.1007/s11227-008-0245-610.1007/s11227-008-0245-6Search in Google Scholar

[15] Sathya S., Babu K., Survey of Fault Tolerant Techniques for Grid, Comput. Sci. Rev., 4, 101–120, 2010 http://dx.doi.org/10.1016/j.cosrev.2010.02.00110.1016/j.cosrev.2010.02.001Search in Google Scholar

[16] Therasa S., Sumathi G., Dalya S., Dynamic Adaptation of Checkpoints and Rescheduling in Grid Computing, Int. J. Comput. Appl., 2, 95–99, 2010 http://dx.doi.org/10.4156/ijact.vol2.issue4.1010.4156/ijact.vol2.issue4.10Search in Google Scholar

[17] Thysebaert P., Volckaert B., De Turck F., Dhoedt B., Demeester P., Evaluation of Grid Scheduling Strategies through NSGrid: A Network-Aware Grid Simulator, J. Neural, Parallel and Scientific Computations, special issue on grid computing, 12, 353–378, 2004 Search in Google Scholar

Published Online: 2013-4-2
Published in Print: 2013-3-1

© 2013 Versita Warsaw

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Downloaded on 2.12.2022 from frontend.live.degruyter.dgbricks.com/document/doi/10.2478/s13537-013-0103-3/html
Scroll Up Arrow