Abstract��Text-To-Speech (TTS) is one of the important technologies of human��computer interaction. The current state��of��art HMM based TTS can produce highly intelligible and natural output speech and deliver a decent segmental quality. However, its duration tends to be unnatural. In this paper, the state durations were generated by jointly maximizing the duration likelihoods of state, phone and syllable units. By considering the duration of state and longer units jointly, the accumulation of errors in estimated state durations was regulated in the optimization procedure. Experiments on Mandarin databases show that the optimized model yields more accurate duration predictions, compared with the baseline state duration model. The improvement of phone RMSE is 2.45ms. The perceptual test further confirms that the optimized duration model outperforms the baseline system.