WEO啦

首页 » 正文内容 » 合成语音编码分析
合成语音编码分析
收录时间:2022-11-25 21:36:28  浏览:0
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 4 NO 3 MAY 1996 243 1063 6676 96 05 00 0 1996 IEEE On Improving Performance of Analysis by Synthesis Speech Coders S Cucchi M Fratti and M Ronchi Abstract An algorithm for improving the performance of analysis by synthesis A by S speech coders is presented It is shown that proper modifications of the A by S strategies commonly used to determine the excitation parameters can be devised In particular a redefinition of the classical objective function based on the weighted mean squared error criterion may allow a better tracking of the speech signal transients as well as a better alignement of the synthetic pitch pulses The implementation details o f the algorithm are discussed and experimental results are presented I INTRODUCTION Analysis by synthesis A by S speech coders are widely used to encode speech at medium low bit rates In particular linear prediction based LPC based A by S coders have been studied and developed since Atal and Remde s work l on multipulse linear prediction MP LPC An excellent survey on the most common A by S coding algorithms can be found in 2 In principle the main difference among the various LPC based A by S coding techniques consists in making particular a priori assumptions about the nature of the excitation signal that is put into the all pole LPC synthesis filter Irrespective of the nature of the excitation signal all the A by S coders determine their parameters by minimizing a perceptually weighted mean squared error WMSE criterion While A by S coders provide good or excellent quality at medium bit rate the level of their performance may decrease rapidly as the bit rate is lowered It has been recognized that an incorrect alignment of the pitch period markers 3 or a slow convergence in tracking the signal transients 4 are among the major causes of quality degradation in A by S coders In this work we propose a method to alleviate these problems in particular the following modifications to the aforementioned WMSE criterion can be introduced 1 A look ahead procedure in order to allow a better convergence speed in signal transients and 2 an objective function that takes into account the behavior of the ideal excitation in order to obtain a better alignement of the pitch markers Each of these modifications will be discussed in the sections below 11 A by S EXCITATION ISSUES In usual A by S coders the parameters of the LPC filter excitation waveform are determined according to an objective function that relies on the well known WMSE criterion In the formula N 1 E n Gu n I2 1 n O where N is the length of the A by S segment us n is the perceptu ally weighted LPC response to the generic ith excitation waveform i e the ith codeword G is the codeword gain and r n is the reference or objective signal i e the perceptually weighted speech signal segment after subtraction of the perceptually weighted LPC synthesis filter memory ringing Note that no constraints are Manuscript received March 3 1995 revised November 21 1995 The associate echtor coordinating the review of this manuscript and approving it for publication was Dr Spiros Dimolitsas The authors are with Alcatel Telettra Milan Italy Publisher Item Identifier S 1063 6676 96 04096 5 imposed on the nature of the excitation signal under consideration In fact any excitation signal can be considered as being a generic codeword of a codebook of arbitrary dimension As an example in early CELP coders 9 a codebook made up of distinct segments of Gaussian noise samples was employed in MP LPC coders all the admissible combinations of pulse positions and related amplitudes form an uncostrained pulse structured codebook The objective func tion defined in l although commonly used may not be optimal with respect to the choice of the excitation parameters G and i In fact since the synthesis model is causal the excitation samples located near the frame beginning Le for n close to zero will give a greater contribution in the WMSE sense with respect to the excitation samples located near the frame end i e for n close to N 1 This fact may lead to an inefficient approximation of the ideal excitation i e the LPC residual in case of sudden unvoiced to voiced transients and of steady state voiced sounds In both cases a careful reconstruction of the ideal excitation is important especially around the pitch period markers where the typical pulselike characteristics must be preserved both in amplitude and position In case the pitch markers of the ideal excitation are located near the A by S frame end a proper reconstruction may become difficult since the contribution of the corresponding synthetic excitation samples to the reconstructed speech has a minor weight in a WMSE based Criterion In Fig l a the ideal excitation waveform in a 4 8 kbs CELP coder is depicted case of a male speaker with a fundamental frequency of about 100 Hz In Fig l b the corresponding synthetic excitation obtained by using the objective funtion 1 is depicted The CELP synthetic excitation is obtained with the contribution of an adaptive codebook and a fixed stochastic codebook The A b y S frame length is 7 5 ms Note the poor reproduction of the pitch markers along all the speech transient This slow convergence to the typical pulselike characteristic of the excitation is the major source of reverberation and unnaturalness in reconstructed speech III A LOOK AHEAD PROCEDURE In order to cope with the above problems a first approach based on a waveform look ahead procedure can be devised The rationale behind it is now explained In Fig 3 an ideal LPC based waveform synthesis scheme is depicted The ideal excitation e n together with the filter state allows the waveform reconstruction on a frame basis i e n O N 1 N being the frame length The ideal excitation drives the LPC synthesis filter into a final state that will contribute to the waveform synthesis in the next frame This contribution may be quantified by zeroing the filter input in the next frame and measuring the waveform evolution of the zero input filter for n iY 1 Depending on the dynamic of the last excitation samples and on the filter poles this free evolution may have a significant dynamic range In particular if there is a significant energy concentration in the last excitation samples the contribution of the speech waveform free evolution in the successive frame may play a major role Therefore the usual A by S approach should be reconsidered in order to take into account also this free evolution contribution The A by S based on the usual WMSE may not be optimal in this respect Due to the causality of the synthesis model the last synthetic excitation samples will have typically a minor influence on the objective function A simple way to solve this problem is based on including lhe speechlfree evolution into the WMSE based 244 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 4 NO 3 MAY 1996 0 1 I I I I 1 I I I I I I 0 0 1 I I I I 1 I 1 1 I I 0 100 200 300 400 500 600 700 800 900 1000 a 0 1 I I I I I I I I I I 0 0 1 0 1 I I I I I I I I I I I I I I I I I I I J 0 100 200 300 400 500 600 700 800 900 1000 b 1 0 0 1 I I I I I I I I I J 0 100 200 300 400 500 600 700 800 900 1000 C Fig 1 a Ideal excitation in a 4 8 kbs CELP coder b corresponding synthetic excitation using the objective function 1 c improved synthetic excitation using the modified objective function 7 in conjunction with the look ahead procedure A4 8 LY 0 8 Case of male speaker with a fundamental frequency of about 100 Hz objective function In formulas the A by S criterion can be restated as follows N l M E r n Gu n O We have defined 3 4 e T 7 t T R TL O N 1 71 N N 1 M T z v L u n u 7t 5 n v R R N N 1 M 6 as well as 12 0 N 1 In 2 rS n is the reference signal as in l and v n is the corresponding free evolution which is appended to rS n Corre spondingly ut n is the perceptually weighted LPC filter response to the ith codeword as in l and ut n is its free evolution M is the free evolution length The excitation parameters i and G are chosen in such a way that the modified WMSE based objective function 2 is minimum In principle the following steps should be performed in order to compute the free evolution v n v n Inverse filtering by means of the all zero LPC filter of rS n uI 7a along the A by S frame i e R O N 1 Obtain the prediction residual e n All pole filtering of the prediction residual along the A by S frame Reobtain rs 7 ut n and drive the all pole filter into a final state Zero input all pole filtering starting from the final state ob tained in step 2 The filter ringing thus obtained is just 4 7 L vc n An examination of the above computation steps reveals that there is no need of computing the prediction residual It is sufficient to preload the filter state with the last p samples p being the filter order of n n and to evaluate the zero input filter response The described look ahead procedure allows us to take into account in the WMSE criterion the contribution that the synthesis due to the ith codeword would have in the successive A by S frame this can be carried out without the need to consider speech samples beyond the A by S frame therefore the algorithmic delay of the encoding process is not increased Finally note that for the v n and v n calculation the same set of L E coefficients used in the T R and ul n calculation is employed Consider that the A by S speech coder has a frame for the LPC parameters and l i subframes per frame for the excitation The free evolution is computed correctly for the first I 1 subframes but this is not true for the last one The free evolution of the excitation calculated in the last subframe will occur in the first subframe of the next frame and therefore should be computed in principle with another set of LPC coefficients However in order not to increase the system algorithmic delay this look ahead information must be computed with the previous LPC parameter set This may not be the optimal choice but it has been observed not to have a very significant impact on performance IV A MODIFIED OBJECTIVE FUNCTION In the preceding paragraphs we have pointed out the fact that in order to obtain a natural sounding synthetic speech it is important that the reconstructed excitation is similar to the ideal one especially with respect to the temporal localization of the pitch markers From this observation it comes out that it may be desirable to obtain a good similarity between the LPC ideal excitation and the synthetic one IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 4 NO 3 MAY 1996 245 0 1 r I I I I I I I I I I I I 0 1 I 0 50 100 150 200 250 300 350 400 450 500 a 0 I I I I I I 0 5 Q 100 150 260 250 300 350 400 450 500 0 1 b 0 I I I I I I 0 50 100 150 200 250 300 350 400 450 500 c Fig 2 Same as Fig 1 but with a female speaker with a fundamental frequency of about 210 Hz I h e evolution speech ideal excitation Fig 3 General LPC based waveform synthesis scheme By using the usual WMSE based objective function the parameters of the synthetic excitation allow to obtain a reconstructed speech that is similar on the average to the original one Actually from a perceptual point of view it is often more important to obtain a local and very close similarity as an example the reconstruction of an unvoiced voiced attack with the correct temporal alignment duration and envelope is important for maintaining good quality it is not uncommon to find attack transients whose temporal duration is much shorter than the length of the A by S frame A close similarity between the ideal excitation and the synthetic one may help in obtaining a good local speech reconstruction Therefore the objective function can be modified in order to take into account also the contribution of the WMSE with respect to the ideal excitation In particular the objective function can be thought as composed of two contributions with respect to the reference signal and to the ideal excitation respectively In formulas we get n O n O In 7 e n is the prediction residual obtained by inverse filtering of the reference signal T n e n is the ith codebook excitation that generates the synthetic signal u2 n cy is a paramater that controls the balance between the WMSE with respect to the reference signal and the WMSE with respect to the ideal excitation 0 5 cy 5 1 The synthetic excitation parameters i e i and G are chosen in such a way that the objective function 7 reaches its minimum By zeroing the derivative of 7 with respect to G we obtain N 1 N 1 Q n u n 1 cy e n e n CL U T 1 a e3n 8 n O n O N 1 N 1 Gopt n O n O The excitation indiex i is selected in such a way that 246 A IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING VOL 4 NO 3 MAY 1996 compute and append bee evolution U is maximized It is obvious that by imposing a 1 we reobtain the usual WMSE based objective function 1 and the corresponding expression for Gopt It is worth noting that the look ahead procedure described in the preceding paragraph can be used in combination with the above modified objective function It is sufficient to replace the termsr n andu n of 7 withthetermsr n andv n of 2 Indeed as will be shown in the successive section we found that the most significant improvements to the synthetic speech quality were obtained by using the two procedures in conjunction with a proper tailoring of the parameters M i e the free evolution length see 2 and a The general synthesis scheme of an A by S LPC based speech coder relying on the described algorithmic modification is depicted in Fig 4 V SIMULATION RESULTS The validity of the proposed approaches was tested in two con 1 A 4 8 kbs CELP coder with forward predictor adaptation buffering size of 30 ms and four subframes of 7 5 ms The excitation consisted of two contributions from an adaptive codebook and a fixed stochastic codebook respectively 2 A low delay 8 kbs CELP coder with backward predictor adap tation and buffering size of 2 5 ms The excitation consisted of a single contribution from a sparse stochastic codebook At the beginning the look ahead procedure and the modified ob jective function approach were tested separately by careful tailoring of the free evolution length M and the balancing factor a However it was found that using the two approaches in conjunction provided the best improvement in the synthetic speech quality over a wide range of speakers and languages For each coder configuration the values of the M and a parameters were optimized carefully by means of subjective listening tests in particular the following values were found 1 for the 4 8 kbs CELP M 8 and a 0 8 for both the adaptive and the stochastic excitation and 2 for the 8 kbs low delay CELP M 4 and a 0 7 figurations Informal listening tests revealed a noticeable improvement in the synthetic speech quality in particular reverberation effects were reduced in low pitched male speakers also a more natural voice was obtained in high pitched female speakers In fact the two approaches used in conjunction allow a better tracking of the speech signal transients and a more accurate reconstruction of the ideal excitation in Fig l c a segment of the improved synthetic excitation is depicted for the case of the 4 8 kbs CELP coder By comparing the plots in Fig l b and c with the plot in Fig l a ideal excitation it can be seen that the periodic characteristic of the excitation and the waveform dynamic especially around the pitch period markers are reconstructed more accurately Similar considerations can be made by examining Fig 2 case of a female speaker with a fundamental frequency of about 210 Hz in steady state voiced conditions although the classical synthetic excitation Fig 2 b shows a good degree of periodicity when compared to the ideal excitation Fig 2 a the improved synthetic excitation Fig 2 c reveals a more precise tracking of the pulselike characteristics The balance factor a in the modified objective function 7 could be made time varying i e adaptive Some experiments were conducted in this sense but no further improvements were obtained nevertheless we believe that an adaptation of the a factor as function of certain characteristics of the speech signal is likely to still improve the performance
温馨提示:
1. WEO啦仅展示《合成语音编码分析》的部分公开内容,版权归原著者或相关公司所有。
2. 文档内容来源于互联网免费公开的渠道,若文档所含内容侵犯了您的版权或隐私,请通知我们立即删除。
3. 当前页面地址:https://www.weo.la/doc/1351b53b017c75db.html 复制内容请保留相关链接。