Orthonormal Canonical Correlation Analysis


 Complex managerial problems are usually described by datasets with multiple variables, and in lack of a theoretical model, the data structures can be found by special multivariate statistical techniques. For two datasets, the canonical correlation analysis and its robust version are known as good working research tools. This paper presents their further development via the orthonormal approximation of data matrices which corresponds to using singular value decomposition in the canonical correlations. The features of the new method are described and applications considered. This type of multivariate analysis is useful for solving various practical problems of applied statistics requiring operating with two data sets, and can be helpful in managerial estimations and decision making.


Introduction
Multivariate statistical analysis presents a natural approach to description and modeling of big data available nowadays on various real managerial objects and problems. In a lack of an adequate theoretical model, which is a common situation for any practical research with complex social-economic data, a proxy structure can be identified by special multivariate statistical techniques. For two sets of variables of different designation, the Canonical Correlation Analysis (CCA) is the main tool for finding parameters for their aggregation into the total scores with maximum correlation (Dillon and Goldstein, 1984;Hardle and Hlavka, 2007;Izenman, 2008;Hardle and Simar, 2012). CCA serves to find how one group of the predictor variables can influence another group of the outcome variables.
The CCA technique was originated by H. Hotelling in 1936 for studying correlations between the aggregated variables of two datasets, and has been developed in multiple aspects, including nonlinear analysis and several data sets (Horst, 1961;Fornell, 1982;Kessy et al., 2018). Its applications are known in different fields, from management and information systems to machine learning and biometrics -see examples in (Ahituv et al., 1998;Hair et al., 1998;Hardoon et al., 2004;Andrew et al., 2013;Adrover and Donato, 2015;Cao et al., 2015;Wilms and Croux, 2015;Jendoubi and Strimmer, 2019).
The RCA method can be also described as the Singular Value Decomposition (SVD) of the cross-correlations matrix between the variables of two data sets, and in contrast to the regular CCA it does not contain the matrices inversion, therefore, has no impact from the matrices' ill-conditioning, so its results are not prone to multicollinearity among the variables within each set of variables. Classical SVD method is widely applied in multivariate statistics -see, for example, in (Horn and Johnson, 2013;Gentle, 2017;Mair, 2018;Demidenko, 2020;Irizarry, 2020). Various modifications of SVD have been introduced for specific solutions (Lipovetsky and Tishler, 1994;Lipovetsky and Conklin, 2005a;Lipovetsky, 2009Lipovetsky, , 2016. A special application of SVD to finding the predictors' relative importance in multiple regression was described in (Lipovetsky and Conklin, 2015). The approach taken in that work is based on the so-called orthonormal matrix approximation to a dataset, originated in the works (Gibson, 1962;Johnson, 1963Johnson, , 1966. Applying this technique to the canonical correlation matrices yields a new technique of the ORthonormal Canonical Analysis (ORCA), which holds the good properties of both CCA and RCA. The new method and its properties are described and applications considered. This type of modern multivariate analysis can serve in solving various practical problems and can be helpful to managerial estimations and decision making.
The paper is organized as follows. Regular CCA and RCA are described in Section II, the SVD and matrix orthonormal approximation are presented in Section III, and the ORCA method is introduced in Section IV. Relation to multiple regression is shown in Section V, numerical examples and comparisons are given in Section VI, and Section VII summarizes.

CCA and RCA techniques
Let us briefly describe canonical correlation analysis, or CCA, and its modification in the robust canonical analysis, or RCA. Consider two data sets presented in the matrices X and Y of the orders N×n and N×m, where N is a number of observations, and n and m are numbers of variables, or columns in X and Y, respectively. Assume for convenience that the variables are centered and normalized by their standard deviations. The scores of each data set aggregation variables are: where ξ and η are vector-columns both of the N-th order, and a and b are the vector-columns of parameters of the order n and m, respectively. Canonical correlation between these data sets, or the pair correlation of their scores ξ and η, measures a connection between these sets: where cov and var denote sample covariance and variance, the prime denotes a matrix transposition, and the correlation matrices are denoted as follows: CCA estimates the vectors a and b via maximizing the correlation (2) by the conditional Lagrange objective where λ and µ are Lagrange terms. Differentiating L (4) with respect to the vectors a and b and putting these differentials to zero yields a system of equations: Multiplying vector-rows and by the first and second equations (5), respectively, and using conditions (4) yields the equality of the Lagrange multipliers to the bilinear form: Solving the second equation (5) for the vector b and substituting it into the first one, also symmetrically solving the first equation (5) for the vector a and substituting it into the second one, produces the expressions: Finally, the equations (7) can be rewritten in the regular CCA form: The eigenproblems (8) present classical CCA solution with the eigenvalues for the squared canonical correlations λ 2 and the corresponding pairs of eigenvectors a and b.
In difference to CCA, the RCA considers the covariance of the scores by the standardized data, and estimates the normalized vectors a and b which maximize the numerator in (2) when the objective (4) becomes: Then (5) reduces to the equations with identity matrices instead of the Rxx and Ryy matrices: The relation (6) can be obtained from (10) as well, and (10) yields the eigenproblems: The eigenproblems formulation (10)-(11) correspond to the singular value decomposition of the intercorrelations' matrix Rxy between the variables in two data sets, and the eigenvectors a and b suggest the solution for the maximum covariance between the aggregates (1). In contrast to CCA (8), the RCA solution (11) does not contain any matrix inversions and stays robust in multicollinearity among the variables in any set.

SVD and matrix orthonormal approximation
Let us briefly consider singular value decomposition on the example of the matrix X, remembering that it can be repeated for another matrix Y as well. SVD can be described as a matrix approximation by the outer product of the vectors p and q of the N-th and n-th order, respectively: where ∆ is a constant term, and ϵ denotes the matrix of residuals. In the least squares (LS) objective of the squared error minimization from the first order conditions for the extremum ∂S 2 /∂p = 0, ∂S 2 /∂q = 0, taking into account the vectors normalization p ′ p = 1, q ′ q = 1, we obtain a system of equations: Substituting one of these equations into another produces the eigenproblems where ∆ 2 are the eigenvalues. They yield the matrix spectral decomposition where the matrix rank equals the number of predictors. Combining the eigenvectors p and q into the columns of the matrices P and Q, and all the terms ∆ (square roots of the eigenvalues (15) known as singular values) into the diagonal matrix we can represent the matrix X in the SVD decomposition The SVD formulation (18) presents the matrix X as product of the orthonormal matrix P, diagonal matrix D, and orthogonal matrix Q, which have the following properties: where I is the n-th order identity matrix. By (18) and (19), the correlation matrix equals: where D 2 is defined as the diagonal matrix (17) of the singular values squared, that is the matrix of eigenvalues Multiplying the equation (20) from the right-hand side by the matrix Q yields the relation which corresponds to the first eigenproblem in (15) presented in the matrix form. It is nothing else but the principal component analysis (PCA) eigenproblem for the correlation matrix X ′ X, with the eigenvalues in the matrix D 2 and the eigenvectors in columns of the matrix Q. The second eigenproblem in (15) defines the PCA scores, due to the resulting from (14) relation p = Xq/∆. A function of a square matrix is defined by its application to the eigenvalues in the matrix decomposition by eigenvectors, so the square root of the correlation matrix (20) and its inversion are defined as follows: Indeed, the product of two square root matrices, taking into account the properties of the orthogonal Q matrix, is which returns to the same correlation matrix (20). Also, the product of both matrices in (22) leads, as it should be, to the identity matrix: Now let us describe the so-called a matrix X orthonormal approximation by its SVD vectors. For this aim, using (18)- (19) and (22) consider the following matrix form: where the matrix Z is defined as The matrix Z was introduced in (Gibson, 1962;Johnson, 1966) and used for regression modeling in (Lipovetsky and Conklin, 2015). With relations (19), this matrix has the property of orthonormality: Due to (18) and (26), projections of X by Z, and Z by X, produce the same symmetric matrix: The matrix Z (26) differs from the matrix X (18) by changing the diagonal matrix D of singular values (17) to the identity matrix I. Thus, the matrix Z does not depend on singular values in (17), and it is not prone to inversion problems and multicollinearity in data. The matrix Z presents the best orthonormal approximation to the data matrix X, and as shown in (Lipovetsky and Conklin, 2015) it is possible to improve this approximation to a better fit of X by Z. For this aim, consider a constant parameter t with Z matrix, when the LS objective is: where (27)-(28) are accounted, the trace of correlation matrix equals its order Tr(X ′ X) = n, and∆ denotes the mean value of the singular values (17):∆ The convex function (29) reaches its minimum at t =∆, thus, a higher precision in approximation of X by the orthonormal matrix is X =∆PQ ′ . All the relations (12)-(30) can be rewritten for Y matrix as well.

ORCA method
Let us apply the considered orthonormal presentation technique to the CCA problem, where are two data matrices X and Y, and for each of them there is the SVD decomposition (18) which can be rewritten more specifically for each matrix: The relation (25) can be represented for both matrices (31) as follows where two matrices Z are defined similarly to (26). In regression modeling, the multicollinearity effect occurs when correlation matrix is close to a singular matrix, or some its eigenvalues are close to zero. Then inversion of such a matrix corresponds to division by a value close to zero, which inflates the model coefficients to big values of both signs. The multicollinearity within anyone of two data sets can impact the CCA estimations (8) deteriorating the solution.
Let us describe how to escape these undesirable effects with help of the orthonormal canonical analysis, or in the ORCA approach. Consider the first of the CCA relations (8) which can be rewritten in the following structure, taking into account the definitions (3) Using the constructs (32) we can represent the formula (33) via the orthonormal Z matrices: A similar relation, up to interchange of x and y subscripts, can be derived from the second equation (8) for the vector b. Introducing two new vectors alpha and beta defined by the transformations we can rewrite the expression (34) and its counterpart for another vector as follows: These eigenproblems present the ORCA solution expressing the original CCA (8) via the orthonormal matrices (26) for both original matrices as in (32): Such matrices contain the SVD vectors but not the singular values, so they are neither ill-conditioned nor prone to collinearity in data, thus, the ORCA problems (36) yield the robust solutions for the eigenvectors. Similar to the matrices Rxy = X ′ Y and Ryx = Y ′ X built in (3) by the original data X and Y, consider the cross-product matrices Sxy and Syx via the orthonormal matrices Zx and Zy: Then the eigenproblems (36) can be simplified as follows: that reminds the RCA technique described in (11). Thus, the ORCA solution presents the SVD (39) applied to the products (38) of the orthonormal approximation matrices (38). With the eigenvectors from (39), the original vectors of CCA (8) can be found from the relations (35) inverted: Each one of the original CCA eigenproblems (8) contains two inverted matrices R −1 xx and R −1 yy , that produces an impression that both vectors a and b are prone to multicollinearity in any original group of variables. However, the results (40) prove that vectors a and b depend on possible ill-conditioning only in one matrix of X or Y data which it aggregates (1). For example, if the matrix Dx contains singular values close to zero, but the matrix Dy does not have small values, then only the vector a is impacted by multicollinearity.
It is possible to substitute the relations (37) into (36) to represent ORCA eigenproblems in a deeper data structure as follows: For the orthogonal matrices Q, with properties (19) and Q −1 = Q ′ we can transform (41) to: Introducing new vectors by the transformations we can reduce the relations (42) to the expressions These eigenproblems reveal the simple CCA structure given via the orthonormal P matrices of the SVD solutions (31). Similar to the matrices Sxy and Syx built from the orthonormal matrices Zx and Zy in (38), we can introduce cross-product matrices Txy and Tyx from the orthonormal matrices in (44): Then the eigenproblems (44) reduce to the following forms: that corresponds to the RCA techniques (11) and (39). Therefore, the ORCA solution presents the combined SVD (46) applied to the product (45) of the separated SVD scores Px and Py (31). When the problems (46) are solved, the original vectors of CCA (8) can be found, similarly to (40), from the relations (43) inverted: The ORCA eigenproblems (39) or (46) have the same canonical correlations values λ. Their eigenvectors depend on the normalization of the matrices used, but these vectors can be transformed by (40) or (47)

Relation to multiple regression
As it was discussed by the relations (40), and we can see it by (47) too, the vectors a and b depend on ill-conditioning only in the corresponding matrix X or Y. The extent at which the vectors are impacted by multicollinearity depend on the inverted singular values in the matrices D −1 x or D −1 y , respectively. It means that the level of multicollinearity is rather measured by the square root of the inverted matrices, R −1/2 xx and R −1/2 yy , not by the inverted correlation matrices R −1 xx and R −1 yy themselves. This conclusion can be related to measuring a degree of multicollinearity in regression modeling as well.
Indeed, if in one data set (1) there is only one variable y, then the CCA problem reduces to the multiple regression model. With one variable y, the vector b is degenerating to a scalar, the matrix Rxy is reducing to the vector rxy of the regressors x correlations with the outcome. Then the first equation (5), taking the constant terms into the vector of parameters a, reduces to the expression: that is nothing else but the normal system of equations for the regression model of y by xs.
It is interesting to note that the second relation (5) reduces at the same time to the scalar product of the vector rxy and the vector of regression parameters a, which defines the coefficient of multiple determination r ′ xy a = R 2 of the quality of model fit. Solution of the system (48) is given via the matrix inversion as follows: The expression (49) produces an impression that the impact of multicollinearity can be measured by the inverted correlation matrix. The so-called variance inflation factor defined by the diagonal elements of this matrix is often used as a gauge for each j-th predictor influence on the effect of multicollinearity (Lipovetsky and Conklin, 2005b): The elements of the inverted matrix in (49)-(50), as we can see by (20)- (22), are defined with the squared matrix of the inverted singular values, D −2 = diag(∆ −2 1 , . . . , ∆ −2 n ), or the inverted eigenvalues of the correlation matrix. However, the solution (49) is defined via the non-squared inverted singular values. Indeed, using SVD results from (18) and (20) in (49), we can re-write it as follows: The expression (51) reveals that the solution (49) for the coefficients of multiple regression depends not on the eigenvalues ∆ −2 j , but on the singular values ∆ −1 j in the matrix D −1 . It means that in measuring the multicollinearity impact on the regression parameters, instead of the scale of VIF-factors (50) which depends on D −2 , it makes sense to use a more adequate to (51) scale related on D −1 . For example, in place of the indicator (50), it could be better to apply such a measure as the square root of it. In another evaluation, taking (22) in account, the VIF-adjusted can be rather evaluated as following: The expression (52) can be also used as the measure of multicollinearity for CCA and ORCA problems. Another important note can be made by the expression (51) which actually presents the so-called pseudoinverse, or a generalized matrix inverse also known as the Moore-Penrose inverse. It presents the least squares solution (49) of the system of linear equations (48) expressed via the SVD matrices (18) or (31). This solution can be obtained directly from the linear model y = Xa written via the orthonormal P, diagonal D, and orthogonal Q matrices: which have the following properties Multiplying this relation from the left-hand side subsequently by the operators P ′ , D −1 , and Q, and using the properties (19) yields the equation which coincides with the solution (51) for the linear regression parameters. Therefore, the regression model as well as the CCA and its modifications can be expressed in the terms of the matrices generalized inversion.

Numerical examples
Consider a data set on a financial problem in the file LifeCycleSavings used in the R statistical package as example for canonical correlation analysis performed by the function cancor. The first matrix X contains two variables (x 1 -pop15, x 2 -pop 75, or percentage of population under 15 and over 75 years old, respectively), and the second matrix Y consists of three variables (y 1 -sr, or the saving ratio, presented by the aggregate personal saving divided by the disposable income; y 2 -dpi, or real per-capita disposable income; y 3 -ddpi, the percentage rate of change in per-capita disposable income). The data is given by 50 countries, with characteristics averaged over the decade to remove the business cycle or other short-term fluctuations, and it is described in (Belsley et al., 1980). The outcome of the function cancor(X, Y) is presented in three numerical columns at the left-hand side in Table 1 as the CCA original solution built by the covariance matrices. The same data centered and normalized by the square root of sum of squares (SRSS, given in the central column in Table 1) correspond to using correlation matrices in the CCA problems (8) which yield results shown in three columns at the right-hand side in Table 1. The CCA solutions, similarly to regression model parameters, are invariant to the scaling transformation, so   (8), which solution is more convenient for comparison of the vector parameters and finding more influential of them by their bigger absolute value. There are two variables in the first matrix X, so two CCA vectors a with two elements, while for the second matrix Y with three variables there are three vectors b, each with three elements. Each pair of vectors a and b defines aggregated scores (1) and the corresponding canonical correlation (2), shown at the bottom line, below each column of vectors. The rank of the matrix used in the eigenproblem (8) equals 2 because it is the minimum size in the matrices entering to that combination. Therefore, there are two positive eigenvalues and one of zero value. The maximum correlation (square root of the eigenvalue λ 2 in (8), where the squared canonical correlations are used) equals 0.8248, and the next one equals 0.3653. Below the vectors a and the vectors b in the standardized units, there are norms of the vectors (square root of the total of vector's element squared). Table 2 presents several solutions. The first solution is shown in two columns of the CCA (8) given by the vectors divided by their norms, so a new norm equals one, as is shown under each vector in this table. The third vector b is not useful for canonical correlation analysis because it is related to zero correlation (as in Table 1), so it is omitted in Table 2. The canonical correlations shown in the last row of Table 2 are the same as in Table 1 because they do not depend on the vectors' normalization. The next two columns in Table 2 present the robust canonical correlations, RCA, obtained by the objective of canonical covariance (11). Taking its eigenvectors for the aggregated scores (1) and calculating with them the correlations yield the values given under these vectors in the last row of Table 2. We see that these correlations are slightly smaller than those of CCA solution, but the eigenvectors are more robust, as it was discussed for the solution (11). The last two columns in Table 2 show the ORCA solution (39). It comprises the best features from the CCA and RCA -on one hand, it has the same maximum possible correlations of CCA, and on the other hand, it is built on the orthonormal approximation of the data matrices, which do not contain the matrices inversion, thus, are robust to the detrimental effects of possible multicollinearity in the datasets.
It is useful to note that the solutions by each vector are very similar across all three discussed methods of CCA, RCA, and ORCA in Table 2. By all solutions for the first pair of the vectors, the predictors x 1 and x 2 enters into the vector a with opposite signs and are connected mostly with the outcome y 2 in the vector b. In the second pair of the vectors, the predictors x 1 and x 2 enters into the vector a with the same sign and are connected mostly with the outcome y 1 of the opposite sign in the vector b. Note that although the eigenvectors are defined up to an arbitrary constant term, but if to change a sign of one vector (for instance, in the vector a) then the sign should be changed in the paired vector (vector b) as well.
Consider another example from a real marketing research project on a specific pharmaceutical product. The opinions of 420 respondents were measured in the Likert 5-point scale by the following characteristics. The first data set X consists of the predictors: x 1 -quick relief; x 2 -safe to use; x 3 -safe with a specific disease; x 4 -few side effects; x 5 -long lasting relief; x 6 -effective for children; x 7 -provides the most symptoms relief; x 8 -provides highest patient satisfaction; x 9 -provides highest quality of life. The second data set Y contains the outcome variables: y 1 -low cost; y 2 -provide most value; y 3 -patient satisfaction; y 4 -likelihood of prescription. Table 3 presents the CCA results similarly to Table 1: original solution for four vectors built by the covariance matrices, the column of SRSS, and CCA by the correlation matrices. There are norms of the vectors below each of a and the b in the standardized units. The canonical correlations (2) of the predictors' aggregates (1) are given in the last line of Table 3, where the maximum correlation reaches 0.692, and the others are noticeably smaller. It is interesting to mention that the pair correlations between the original variables of two datasets are mostly at the level 0.1-0.2, and all of them are below the level 0.5. Therefore, the first pair of the canonical variables presents the best combinations of the variables of each dataset. Elements of the vectors make a sense similar to the loadings in the factor analysis, so they correspond to the most influential variables in the connection between two data sets. The next Table 4 is organized similarly to Table 2. It presents three solutions, each shown in four columns of the vectors normalized by one as indicated under each vector: the CCA (8); the robust canonical correlations RCA (11); and the ORCA orthonormal results of (39). Again, all these solutions are very similar across three discussed methods. The canonical correlations given the last row of Table 4 are the same for CCA and ORCA and they equal those in Table 3. For the RCA, using its eigenvectors for the aggregated scores (1) and calculating with them the correlations (2) yield slightly lesser correlations than those of CCA, but the RCA eigenvectors are more robust. For the sake of a convenient comparison, the signs in several paired vectors a and b in RCA, and alpha and beta in ORCA, were changed to adjust them to the basic CCA vectors. The first vectors' solution of the maximum canonic correlation suggests the main loadings for the variables' aggregation in two datasets. As we see by the vector elements, the main role in the first group is played by the variables x 8 and x 9 , and in the second group the main variables are y 2 and y 4 . The described closeness across the CCA, RCA, and ORCA results was observed by different studied datasets as well.

Summary
The paper considered canonical correlation analysis, or CCA technique, together with the earlier developed robust canonical analysis, or RCA, and the newly introduced orthonormal canonical analysis, ORCA, which as the other problems can also be presented as a specific eigenproblem. The ORCA comprises the best features of the CCA and RCA -on one hand, it has the same maximum possible correlations of CCA, and on the other hand, it is built on the orthonormal approximation of the data matrices, which do not contain the matrices inversion, thus, are robust to the detrimental effects of possible multicollinearity in the datasets. It is also shown that the CCA formulation with two matrices inverted in the eigenproblem of each vector gives an impression that both vectors can be prone to multicollinearity in any original group of variables. However, the results of ORCA derivation prove that each vector depends on possible ill-conditioning only in the matrix of data which it aggregates. Thus, the new solutions can be used for finding a better and adequate interpretation of the variables' integration into their aggregates. The relations with multiple regression modeling and description of the variance inflation factor and its adjustment to CCA problems are also given.
For future studies, the orthonormal canonical correlations approach can be extended to the problems with many data sets. Also, more research is needed for improving measures of multicollinearity in the CCA problems and their modifications. The numerical results demonstrate a close similarity of the considered techniques, so the new approach can be successfully applied to the problems of aggregation in complex data. This type of modern multivariate analysis can serve in solving various practical problems and can be helpful to managerial estimations and decision making.