CE 256: STOCHASTIC HYDROLOGY SHORT PROJECT DATE OF SUBMISSION: 04.12.2012 BIHU SUCHETANA 2ND YEAR M.E., WATER RESOURCES AND ENVIRONMENTAL ENGINEERING SR No.- 08308 1 TABLE OF CONTENTS I. INTRODUCTION……………………………………………………3 II. TIME SERIES PLOT………………………………………………..3 III. MOMENTS OF THE DATA………………………………………..4 IV. DISTRIBUTION OF THE DATA………………………………….4 V. AUTO-COVARIANCE AND AUTO-CORRELATION………….6 VI. PARTIAL AUTOCORRELATION………………………………..6 VII. LINE SPECTRUM…………………………………………………..7 VIII. POWER SPECTRUM……………………………………………….8 IX. DIFFERENCED DATA……………………………………………10 X. GENERATION OF MONTHLY DATA………………………….12 XI. STANDARDIZED SERIES………………………………………..16 XII. ARMA MODELS…………………………………………………..19 XIII. SIMPLE LINEAR REGRESSION………………………………..20 2 INTRODUCTION: Any engineering design with hydrological inputs will have some uncertainties involved. Analysis of these inherent uncertainties is the main scope of “Stochastic Hydrology”. Hydrologists have always strived to achieve a better predictive capability of stochastic processes. The key to better and more accurate prediction lies in historical data; history provides a valuable clue to the future. Unless some drastic changes occur in the catchment (like rapid urbanization, large scale deforestation etc), persistence of statistical relations is seen to occur. Thus utilizing the available data for a catchment, we generate new data which is not exactly same as the historical data; rather it has the same statistical properties as the historical data. It is this quality of “persistence” that helps us in better prediction and forecasting. Various “tools” can be used to analyze historical data and capture the information conveyed by them. In our course we have studied some of these “tools” which I have attempted to use in this project. The region chosen for analysis is Bardhhaman district in West Bengal, India. The monthly rainfall values were taken for a period of 102 years (1901-2002) from the site www.indiawaterportal.org/met_data/. The rainfall data is enclosed in the excel file named “RainfallWB”. Thus there are a total of 1224 data points. TIME SERIES PLOT A time series typically plots the values taken up by a random variable with time. The random variable in this case is precipitation. Being located in the tropics, at Bardhaman, precipitation implies rainfall. The time series plot for rainfall is as below: Figure 1: Time series plot for the monthly rainfall at Bardhaman from 1901-2002 3 MOMENTS OF THE DATA: The moments of the data provide valuable clues about the statistical properties of the data. The moments obtained from the samples provide estimates about the population characteristics. In hydrology, the first four moments have special significance. The first moment or mean is a measure of the central tendency. The second moment or standard deviation is a measure of dispersion. The next 2 moments, skew and kurtosis are a measure of symmetry and peakedness respectively. The first four moments calculated from the sample data are as below: MOMENT 1st NAME MEAN VALUE 118.090403594771 mm 2nd 3rd STANDARD DEVIATION: COEFFICIENT OF VARIATION: SKEWNESS 4th KURTOSIS 132.158107035107 mm 1.119 1.13889350294147 (skewed to right) 3.42265003752722 (leptokurtic) Table 1: First 4 moments for the original series The relations used are as below: Mean, = 𝑛 𝑖=1 xi Variance, S2 = 𝑛 𝑛 𝑖=1 𝑥−𝑚𝑒𝑎𝑛 2 𝑛−1 Standard Deviation, S= 𝑆 2 Coefficient of Variation, Cv = Coefficient of skew, Cs = 𝑛 Kurtosis coefficient, K = 𝑆 𝑛 𝑖=1 (𝑥𝑖 −𝑥) 3 (𝑛−1)(𝑛 −2)𝑆 3 𝑛 2 𝑛𝑖=1 (𝑥𝑖 −𝑥) 4 (𝑛−1)(𝑛−2)(𝑛 −3)𝑆 4 DISTRIBUTION OF THE DATA The pdf of the data is plotted and is seen to take the following shape. Intuitively, the pdf and the value of kurtosis, which is close to 3 indicates the possibility of normal distribution, but the negative values are absent in this case as rainfall cannot take up negative values. So, we check whether the data follows lognormal distribution. By performing the Kolmogorov-Smirnov test (K.S. test) using the available MATLAB function, it is verified that the data follows log-normal distribution with a mean of 4.77mm and a standard deviation of 4.884mm. Similarly, for the annual average rainfall values, the pdf appears to 4 follow normal distribution, which is confirmed by the K.S test. The annual average rainfall values have a mean of 118.09mm and standard deviation of 18.094mm. Figure 2: PDF of the rainfall data Figure 3: PDF of the annual average rainfall data, following normal distribution 5 AUTO-COVARIANCE AND AUTO-CORRELATION: Auto-covariance indicates the co-variance between points of a series and other elements of the same series separated by a lag of „k‟. The auto covariance and auto correlation are found for the rainfall data. The auto-correlation and the auto-covariance matrices are saved in the folder named “Autocov and Autocorrel”. Now, we plot the auto-correlogram and partial auto-correlogram for the given data. We consider lag up to 0.25 times the number of data points, ie., up to 306. The correlogram hence plotted indicates the memory of the process, ie, how far into the past the process can remember. The correlogram takes the following shape: Figure 4: Auto-correlogram at lag 306 (significance bands at 95%) PARTIAL AUTOCORRELATION Partial auto correlation indicates the explanatory power of one of the variables in regression when the dependence of all other variables has been removed or partialled out. The partial auto correlations plotted against the values of lag give the partial auto-correlogram. The partial auto-correlogram is as shown: Figure 5: Partial auto correlation at lag 306 (significance bands at 95%) 6 The decaying nature of the correlogram indicates the possible presence of periodicity in the data. To capture the periodicity in the data, we obtain the spectral densities in the frequency domain. The line spectrum and power spectrum are used to identify these periodicities. LINE SPECTRUM: The line spectrum gives the amount of variance per unit frequency. The relations used for calculation of line spectrum are given as below: 𝐼𝑘 = 𝑁 2 𝛼 + 𝛽𝑘2 2 𝑘 𝜔𝑘 = 2 𝛼𝑘 = 𝑁 𝑘 𝑁 𝑋𝑡 𝑐𝑜𝑠 2𝜋𝑓𝑘 𝑡 𝑡=1 𝑁 2 𝛽𝑘 = 𝑁 𝑓=𝑁 2𝜋𝑓𝑘 𝑁 𝑋𝑡 sin (2𝜋𝑓𝑘 𝑡) 𝑡=1 where, 𝑘 = 1,2,3, … . .0.25𝑁 The line spectrum of the given data is as below: Figure 6: Line Spectrum of the Original Series From the line spectrum we can notice two significant periodicities corresponding to ω=0.5236 and ω=1.0472. The periodicity, P may be given as: 𝑃 = 2𝜋/𝜔 So, the given data has a periodicity of 6 months and 12 months. 7 Statistical test for significance of periodicities: A statistic is defined as below (Kashyap and Rao, 1976): 𝜸𝟐 ∩ = 𝟒𝝆 (𝑵 − 𝟐) Where, γ2 = αk2+ βk2 2 𝛼𝑘 = 𝑁 2 𝛽𝑘 = 𝑁 ρ= 𝑁 𝑋𝑡 𝑐𝑜𝑠 2𝜋𝑓𝑘 𝑡 𝑡=1 𝑁 𝑋𝑡 sin (2𝜋𝑓𝑘 𝑡) 𝑡=1 𝑁 (𝑋𝑡 −𝛼𝑘𝐶𝑜𝑠 𝑡=1 𝜔𝑘𝑡 −𝛽𝑘𝑆𝑖𝑛 (𝜔𝑘𝑡 ) 𝑁 , and N: total number of data points (1224 in this case) For testing the periodicity associated with a particular ωk , ∩ is compared with F(2, N-2) F(2,N-2)=3 for N > 120 at 95% confidence. Table 2: Test for significance of periodicities.Both periodicities are found to be significant at 95%confidence POWER SPECTRUM: The line spectrum is a statistically inconsistent estimate. To get a statistically consistent estimate, we plot the power spectrum. Tukey windows are used for estimating the lag window ʎj. A maximum lag of 0.25 times the length of the data is used. The plot clearly exhibits a smoothened appearance as compared to the line spectrum. The relations used here are as follows: fk = 𝑘 𝑁 2 αk = 𝑁 𝑁 𝑡=1 𝑥𝑡 cos(2𝜋𝑓𝑘𝑡) 8 2 βk = 𝑁 𝑁 𝑡=1 𝑥𝑡 sin(2𝜋𝑓𝑘𝑡) Frequency, ωk = 2𝜋𝑘 𝑁 Power Spectral Density, Ik = 2[Co +2 𝑁 −1 2 𝑗 =1 CjCos(2πfkj)λj] Where Cj: Covariance at lag j Co: Variance Tukey Window λj: 1 (1 2 𝑗 + 2𝐶𝑜𝑠(2𝜋 𝑀 ) Where M is the maximum lag= 0.25N N: Length of data Figure 7: Power spectrum for the original series, exhibiting smoothened appearance 9 DIFFERENCED DATA: A differenced series removes non-stationarity in data. Using first order differencing, we construct a new series where: Yt=Xt-Xt-1 Where Yt= tth term of the differenced series Xt and Xt-1: tth and (t-1)th term of the original series From the differenced series we obtain the auto-correlogram, partial-autocorrelogram, line spectrum and power spectrum. It is noted that both the differenced line and power spectra are similar to the line and power spectra for the original data. Figure 8: Auto correlogram of differenced series at lag 305 10 Figure 9: Partial-Auto correlogram of differenced series at lag 305 Figure 10: Line spectrum for the differenced series 11 Figure 11: Power spectrum for the differenced series GENERATION OF MONTHLY DATA: Generation of monthly data for 50 years is done using the Non-stationary First order Markov Model or the Non-Stationary Thomas-Fiering Model. The basic equation used is: 𝑿𝒊,𝒋+𝟏 = 𝝁𝒋+𝟏 + 𝝆𝒋+𝟏 × 𝝈𝒋+𝟏 × 𝑿𝒊,𝒋 − 𝝁𝒋 + 𝒕𝒊,𝒋+𝟏 × 𝝈𝒋+𝟏 × 𝟏 − 𝝆𝟐𝒋 𝝈𝒋 where , i denotes the year (1 to 50); j denotes the month (1 to 12); 𝑋𝑖,𝑗 =Rainfall value in jth month of ith year, 𝜇𝑗 =Mean value of rainfall in the jth month, 𝜍𝑗 = Standard deviation of rainfall for the jth month, 𝜌𝑗 = Lag 1 correlation between jth month and (j+1)th month, 𝑡𝑖,𝑗 +1 = Standard normal Deviate 12 Table 3: Correlation of data of a particular month with the next month The generated values of rainfall for the next 50 years are enclosed in Sheet 1 of the MS Excel file called “Generated Values”. MOMENT 1st NAME MEAN 2nd STANDARD DEVIATION 3rd SKEWNESS 4th VALUE (generated data) 114.1 mm VALUE (original data) 118.090403594771 mm 127.1 mm 132.158107035107 mm 1.0022 (skewed to right) 1.13889350294147 (skewed to right) KURTOSIS 2.8961 (platykurtic) 3.42265003752722 (leptokurtic) `Table 4: Comparison of the first 4 moments of generated and original data The values of the moments calculated from the generated data and the original data are approximately equal to each other. This shows that the statistical properties of the original data have been retained during generation. Some of the generated values are observed to have negative values. This is practically not feasible. Hence while using these generated values for reservoir operation we must make sure that the negative values are eliminated. The negative values, however, must be preserved for generation of values in the next time step. For operation and decision making purposes, the generated values which we use are shown in Sheet 2 of the MS Excel file “Generated Values”. 13 The time series plot for the generated values is as below: Figure 12: Time series plot for the 50 years’ generated values Figure 13: Auto-correlogram for the 50 years’ generated values 14 Figure 14: Partial-Auto-correlogram for the 50 years’ generated values Figure 15: Line spectrum for the generated data, showing periodicity corresponding to approximately w=0.5236, similar to original data 15 Figure 16: Power spectrum for the generated data, showing periodicity nearly corresponding to original data STANDARDIZED SERIES: Standardization of the original time series maybe done in 2 ways as follows: (a) Using the long term mean and standard deviation for standardization, (b) Using the monthly mean and standard deviation for standardization. Standardization implies subtracting the mean from the original data and then dividing it with the standard deviation: 𝑋𝑠𝑡 = (𝑋𝑖 − 𝜇)/𝜍 Where Xst= Standardized value Xi= Original Value µ=Mean of the Original Series 𝜍=Unbiased Standard Deviation of the Original series 16 The advantage of standardizing is that the periodicities are removed in the resultant series. In this case, it is seen that the standardization using the second technique, i.e., by using monthly mean and standard deviation yields a series devoid of periodicity. Figure 17: Comparative time series plot of original and standardized data Figure 18: Auto-correlogram for the standardized data 17 Figure 19: Partial-Autocorrelogram for the standardized data Figure 20: Line spectra for the original and standardized series. The periodicities are seen to have been removed 18 Figure 21: Power spectra for the original and standardized series. The periodicities are seen to have been removed ARMA MODELS (Auto Regressive Moving Average Models): ARMA models are used both for one time-step ahead forecasting, ie, operation problems as well as long term generation, ie, planning problems. The steps to be followed, in order, are model identification, parameter estimation and validation. Here, the model selection is based on the maximum likelihood rule (Kashyap and Rao, 1976) which is as below: 𝐿𝑖 = − 𝑁 ∗ ln 𝜍𝑖 − ƞ𝑖 2 where, Li=Likelihood of the ith model N= Total number of data points. 𝜍i= standard deviation of the residual series of the ith model ƞ𝑖 = Total number of parameter, sum of the number of AR and MA parameters The maximum likelihood estimate is in agreement with the Principle of Parsimony by Box and Jenkins (1970). The auto-correlation shows a sinusoidal decay and the power spectrum shows the dominance of the middle frequencies. This may lead to an initial guess that AR models maybe suitable for data generation. In hydrology we typically go up to AR(6) models for data generation and forecasting. The AR models maybe represented by the following general equation, where 𝜖𝑡 represents the residual series and p represents the order of the AR model: 19 𝑝 𝑋𝑡 = Φ𝑖 𝑋𝑡−𝑖 + 𝜖𝑡 𝑖=1 Using the “armax” function in MATLAB, we obtain the parameters of the AR model, which are tabulated as below: Table 5: Values of the AR parameter 𝚽 Of the 6 candidate models, it is seen that the ARMA(1,0) model yields the best results. The generated values are seen to have a mean of 92.4534mm and standard deviation of 103.3727mm, which are close to that of the original series. The skewness coefficient is 1.0827 and the coefficient of kurtosis is 3.092. So, of the 6 candidate models the AR(1) model can be used for most accurate data generation. The residual series (when applicable) should ideally exhibit properties of white noise, ie, should have a mean of 0, should be uncorrelated and should not have any periodicity. SIMPLE LINEAR REGRESSION: Using this technique we try to fit a linear equation between the dependent and independent variable, with the help of available data. Correlation between variables X,Y is given by: ϒxy = 𝑛 𝑖=1 (𝑥𝑖 −𝑥′)(𝑦𝑖 −𝑦′) 𝑆𝑥 𝑆𝑦 xi,yi=Data points in the two series which are to be regressed x‟,y‟ = Mean of the respective series sx, sy= Standard deviations of the respective series 20 Table 6: Values of avg. monthly precipitation, avg. monthly temperature & avg. monthly cloud cover The correlation coefficients are given as below: Precipitation Average Average monthly cloud cover monthly temperature Precipitation 1 0.275703521 0.983832629 Average monthly 0.275703521 1 0.38316949 0.983832629 0.38316949 1 temperature Average monthly cloud cover Table 7: Correlation coefficients between given variables From the above table it is clear that precipitation and average monthly cloud cover have a very strong correlation while precipitation and average monthly temperature do not have as strong a correlation. So, using simple linear regression, we try to fit a relation between precipitation and average monthly cloud cover. Equation for a straight line is given by: y= a+bx The predicted value of y, y‟ is given by: y‟=a+bxi 21 Using the least square error method, the values of the coefficients a and b are calculated as: b= Ʃ(xi-xmean)*(yi-ymean)/Ʃ(xi-xmean) = 0.177376601 a=ymean-b*xmean = 25.39079848 Where xi=observations of x (precipitation) yi=observations of y (average monthly cloud cover) xmean= average value of temperature ymean= average value of cloud cover So, the regression equation between Precipitation (x) and average monthly cloud cover (y) is: y= 0.177376601x+25.39079848 As average monthly cloud cover and average monthly temperature are correlated, we cannot use multiple linear regression to find the relation between these two variables and precipitation. -------------------X--------------------- 22