Improvement of substation Monitoring aimed to improve its eflciency with the help of Big Data Analysis

: Data analysis has become most widespread field of research and it has extended in almost every field of study. Considering the recent trends and developments in the field of communication and information technology, there is a scope of combining the monitoring of substation equipment with big data analysis technology. That will result in an improved data analysis ability, information sharing and utilization rate of monitoring data. In the proposed work, the authors have introduced the big data analysis and its corresponding application in the monitoring of substations. Basic concepts and the procedures of the typical data analysis for general problems are also discussed. As a main part of the paper, different types of distributed data analysis techniques have been proposed, in which two relational online analysis, namely Hive and Impala and one H Base multidimensional online analysis are important. These data analysis techniques are proposed considering the analysis efficiency, storage performance from the business development requirements point of view of the substation. The result obtained depicts that the proposed model has an advantage in storage overhead and roll-up performance, when compared with the traditional method, although the data loading speed is approximately 1.7-1.9 times of the traditional model. Some experiments are carried out in order to verify the validity of the model.


Introduction
In today's scenario, thinking a life without electricity is almost impossible, as an individual or in general we are totally dependent on electricity. With the advancement in technology that dependency is continually increasing. Figure 1 [1] shows how the consumption of electricity has increased specially in Asia specific over the decades. This dependency over the electricity required a robust electric distribution and generation system, which should have a minimum failure rate. To ensure that and to avoid power outages, transmission lines and electrical substation (referred as substation hereafter) equipment must be periodically monitored and precautionary measures should be taken to avoid any fault [2,3]. In particular the role of substation become vital in that and efforts to make it smart and technology driven is a need of today's Power system. Its proper operation may be ensured with periodical monitoring and inspection. This operation generally referred as Substation patrol and is broadly classified into three categories namely Manual Inspection, Intelligent inspection and Remote inspection. Manual inspection has the features of high-work intensity and low labor efficiency. In this kind of inspection the role of inspector is very important and his/her analytical and comprehensive abilities play an important role. Whereas the intelligent inspection has the advantage of high repeat inspection work, high efficiency, high degree of intelligence protocol operation and complete portable testing equipment but it does not display good stability and reliability. The third and most recent inspection method is the remote inspection [4,5], and mainly based on two monitoring systems namely power transmission condition and high definition video remote real-time monitoring system. Remote inspections actually contain realistic applied values for those unmanned substations [6]. Despite of choosing any method of inspection, any significant assessment can be made from the collected report/data, only if we can carry out Rational, effective and efficient analysis of these data [6]. State monitoring of substation equipment generate redundant and diverse data. This data required to develop in a uniform standard. Hence, in proposed model combination of MYSQL and Hadoop Distributed File System is used as a Big Data analysis Platform.

Motivation
This is becoming a major challenge to manage and effectively utilize these data as the amount of data is increasing extravagantly, with the digitization and up-gradation of technology. This fact will become more evident if we consider the unit for measuring the data on internet, that has changed into exabytes (1018) and zettabytes (1021). A byte is a data unit comprising 8 bits, and is equal to a single character in one of the words you're reading now. An exabyte is 1 billion bytes [7]. Not only the size of data but the structure of data is also becoming complicated. This has created a new challenge to the electric power enterprises, to analyze and process these data. At present various kind of diagnostic and state monitoring devices are existing with numerous interfaces. The collected data from these monitoring devices are stored in huge data centers from there they got accessed for various purposes including the monitoring of the substation [7]. It's not always easy to effectively utilize these data and take important and meaningful decision related to the monitoring of substation. As the substation data is continually increasing because of continuous increase in the load and consumption, the traditional methods for the analysis of these data is falling to cope-up with the situations. Considering this it's become crucial to design or develop some other techniques which can make effective use of these data and can effectively contribute in the monitoring of substation [8].

Specific Contribution
If we look into the vivid developments of Information and technology, we can find an appropriate solution to our problem. One can apply these advancements to perceive and control the traditional monitoring system and can make it smart. The proposed work represents state monitoring of substation equipment and is based on connectionless hierarchical coding. The motivation for using hierarchical coding is its capability of converting hierarchical data of dimension table into fact table, after compressing it. That helps in effectively reducing a large number of connection operations and achieves optimized performance. This model makes an efficient use of various tools like Hive, Impala and HBase according to their specific advantages. This has improved the data monitoring of substations and showed a significance improvement over the traditional model. Our model is based on Not-Join Level-Encoding (NJLS) that is more appropriate in dealing with the huge amount of data generated during the inspection of substation.

Organization of the paper
The rest of the paper is organized as follows. This section gives a brief of related literature and explains the contribution of various authors made in this regard. Section 2 gives a brief knowledge of the theoretical background necessary for understandings the proposed work. In section 3 authors have discussed the proposed model in detail. In section 4 authors have carried out experiments to validate the authenticity and validity of the proposed model. In last section we have concluded the paper accumulating the results obtained in section 5.

Literature survey
Out of various research papers considered of the development of this research article, few important ones are listed here. In general these research articles are mainly focusing on the data platform architecture, data cleaning and data storage and based on the status quo analysis of domestic research. Peng L has proposed a Hadoop cloud computing based model for index query and for storing the substation data. This model was proposed in accordance with the characteristic of the data collected [9]. Wang H has proposed a method based on unsupervised learning and time series analysis for irregularity detection. His model was inspired by the fact considering the limitations of existing model in handling the unpredictable and continually varying data of substation equipment [10]. To the knowledge of authors Lu S USES first used Big data and cloud computing for the monitoring of substation. While understanding the existing infrastructure and model Lu has developed a data center, and provided a better cloud computing based solution for the state monitoring of substation, this model was helpful for the analysis of smart grid also [11]. Xie X J has made it evident that the existing system is the limited in its application and wont be able to handle the huge amount of data generated by modern substation and smart grid networks. Xie has used OLAP and other data mining tools for developing a better relation between the data mined and the output drawn based on that. Various efforts have been made across the globe for providing the literature for state monitoring and for the development of intelligent subsystem, but there is a vast scope of development for the practical model [12]. Peng X has developed and designed a model having five stages DWP for the monitoring of substation. The proposed model was not only helpful for the operation of the data centers, but it was helpful in the development process also. The only limitation was the requirement of an expert team which evaluates the obtained results and then used for the improvement of function and analysis of data [13]. A secure communicating network established for communication in [30], Moreover E-Healthcare Framework for Quickest Data Transmission Using Cyber-Physical System. Sensors was established by Sharma et al. in [31]. In [32] authors solved Constraint Quickest Path Problem for Data Transmission Services in Capacitated Networks.

Background knowledge 2.1 A brief about Big Data
Although the definition of Big data is not very discrete, but most of the proposed descriptions are related with each other: This is an emerging technical problem brought by a dataset of various categories, complicated structures and large volume, which requires an innovative framework and techniques to extract and represent meaningful information significantly [14]. According to Zikopoulos & Eaton [15] the definition of big data depends on the abilities of the employed data mining algorithms and the corresponding hardware equipment to deal with large volume datasets. It is a relative concept instead of an absolute definition. In the words of Kaisler et al. [16] the big data can be understood as amount of data beyond technology's capability to store, manage and process efficiently as the data size increasing along with the evolvement of ICT technologies [17].

Key elements of electrical substations
In this subsection, we will briefly discuss various important elements of a typical subsystem [18]. An image depicting important elements is presented in Figure 2 [19].

Fence
The fence of substation is important because it defines the periphery of the substation. It is different from other normal fences in various aspects like; it should be a set of long, semi-trans missive planer surfaces. These surfaces should be separated by vertical poles. The top of the fence is covered with a barbed wired structure for various safety and security reasons. The surface of these kinds of fences is penetrable by laser so that we can get the measurements of the vicinity of substation [20].

Cables
Transmission cables are the backbone of electric power transmission. These are curvilinear in shape, having small thickness relative to its arc length. These are used to transmit electricity from the power generation plants to our home via substations [21].

Circuit Breakers, Bushings and insulators
The function of a circuit breaker is to protect the electrical system with excess current that may be caused with a short circuit or overload in the system. They are generally rectangular in shape and of mainly two types, as shown in Figure 2. These circuit breakers are automatic in their operation and are an essential part of the system. Above these circuit breakers an insulator kind of thing is placed, known as bushing [22].

Bus Pipes
Bus pipes are cylindrical shaped steel structures and used in electric power distribution. It is generally placed inside panel boards, busway enclosures, and switchgear for high current power transmission for short distance. They are supported in air by insulated pillars and are generally un-insulated. This allows us to make as much connection as required without creating joint in the bus wire [23].

Sources and characterization of big data
The monitoring data of substation equipment is collected by data acquisition layer through various sensors and state access controller; the collected data will be further transmitted to state access network shutdown in a web service format. This data set includes the data from generation to consumption of electricity. It can be segregated in structured data such as fault data collected during the abnormal behavior of the device, and discrete information like data related to device laser, etc., semi structured data, like web service data. Another category of data is unstructured data; examples are status information image and related videos, as shown in Figure 3. Few of the collected data is difficult to characterize into these segregations hence represented at the intersection of the categories, like, data related to regional economy, load monitoring, power quality and surveillance video etc. keeping like structured data together make the analysis easy and fast and any analysis tool can be applied to the same set of data together. An example of type of data collected from a substation is shown in Figure 4. Figure 4(a) is demonstrating the fluctuation in the reading of a smart meter which is measuring the energy consumption. This data was taken from Pecon street database [24]; whereas Figure 4(b) is representing fluctuations in the active and reactive powers of substation. From these graphs it is evident that the smart energy meter is not generating huge data but the data generation in case of Power measurement is huge [25].

Design of big data analysis platform for state monitoring of substation equipment
Since the data collected from the monitoring of substation is huge traditional relational database [26], like MySQL has to change to non-relational database. Because of complex data source types, it is necessary to use Sqoop, an open source tool, to conduct ETL (extract, transform and load) on the required data. Now this complex and huge data is to be stored in a unified structure, after data aggregation and association. Once all the related operation on data, like statistical analysis, calculation and query is completed, the result is In this diverse data set, there are a lot of wrong, complex and redundant data, which need to be extracted quickly to develop a uniform standard [27]. Now the most important part of the proposed model is discussed, in which a combination of MySQL and HDFS (Hadoop Distributed File System) is proposed. This integration is carried out by data storage layer in a way so that advantages of both the system can be utilized. After all these processes these data set will be stored in HDFS, whereas the role of MySQL is to store various model information of state monitoring of substation equipment and manage Hive metadata. Another important role played by the MySQL is it stored all the tables' fields and spacers those are created by Hive. During the data operations, the MySQL engine is used to authenticate the existence of metadata whereas Impala is used to share metadata with Hive. Figure 5 shows the complete structure.

The proposed model
The NJLS(not-join level-encoding) perform the conditioning of the monitoring data dimension information. This conditioning is done in accordance with the various coding of dimension hierarchies. It further compresses the data and represents it into the fact table so that condition monitoring can be done, that enable it to execute judgment verbs independently such as operations and reduces query overhead and tedious connection that is more suitable for the monitoring at large scale and for the discrete clusters that are performed on big data analysis. Whereas the dimension level encoding uses decimal and binary encoding methods.

Model performance analysis
One of the important performance parameter of a data model is the Spatial complexity and is defined as the space required by the model. For the designing of the model certain assumptions have been made like, the dimension of the model is assumed to be d∈ [1,α], with a dimension level of l h , and that is a variable and chosen according to the type of data selected [29]. In general we need [log 2 n] binary digits for representing N child nodes, which are from the same subset of the parent node having one-dimension. The code can be expressed as 00 · · · 00−− ⏟ ⏞ of each dimension hierarchies. According to a fix rule a certain measure of hierarchical coding is executed in the query. In order to obtain the status of monitoring data and space complexity use of dimension hierarchical encoding is done that is represented by O(h [log 2 n li ]).

Experimental Results
For verifying and validating proposed model we have carried out it's comparison with the available model. The comparison was carried out while combining the advantages of Hive and impala with star and with NJLS. The setup was run in the computer lab on 10 PCs of similar configuration of Hadoop distributed cluster. We have configured 1 PC as master node and 9 as slave nodes. The PCs in the lab are having a Intel Core i7-9700 Processor having Speed of 3.7 GHz, RAM of 16 GB and Boot Drive Capacity 512 GB. All were installed with Centos virtual machine and setup environment for Hive and Impala were created.

Monitoring data preparation
For the numerical dimension in the state monitoring data, three sets of monitoring big data sets (S 1 S 2 S 3 ) are used in this section, to compare the performance advantages of this model in distributed ROLAP. We have carried out three sets of experiments, to completely observe the performance of proposed model. In the first experiment we have compare the data loading time and speed, in second experiment we have compared the

Monitoring data loading
Out of other experiments performed one important experiment was performed for comparing the performance of data loading in which Impala and Hive were used for data loading. During this process there is a chance that the problem of data inconsistency may occurs, that because of the fact that data have an inherent connection between itself. But this problem occurs only if multiple clues have been used in data loading. In Figure 6, the loading time of monitoring data is presented. Figure 7 shows the speed of data loading. 1. The proposed NJLS model is not advantageous in terms of data loading, when compared with traditional star model, reason being the time consumed in hierarchical coding and preprocessing required in NJLS model. With this experiment its concluded that the loading speed of the proposed model is 42% slower than the traditional model. 2. The Hive and Impala data loading rates were steady and haven't decreased significantly with the increase of data. This is achieved because during the data loading process the data is stored in HDFS and metadata is modified accordingly. 3. It's been observed that the speed of loading monitoring data was somewhat lower than Hive. Because of the particular type of file format that can be used with impala, the data needs to be loaded using Hive before anything else.

Analysis operation
In OLAP, rotating, cutting, slicing and drilling are the main analytical operations. A related term for understanding the proposed model easily is Roll up that represents the cumulative data and improves the low-level data into the high-level in all possible dimensions. If an assumption of considering i and j an element of (1, n) with condition i<j a relationship will develop between total ordered for each l j and l i . this is more evident while observing l j coil to l i and walked-up underlying data. The data gathered is an important element to the distributed ROLAP operation. On contrary, drill down is a reverse of winding operation. In these experiments a comparison is carried out between the execution time of traditional star model and roll-up operation of NJLS model in the monitoring data set S1, S2, S3. By virtue of roll-up process the dimension values of monitoring data is collected from the circuit breaker and monitoring units integrated with transformer and level the device dimension l 2 to the monitor device level with l 1 . Figure 8 shows the roll-up operation time of NJLS model and star model for various size of data set. Figure 9 is representing the rollup performance trend of star model and NJLS in Impala and Hive systems with various state monitoring data sets. The following observations may be drawn from Figure 8 and Figure 9.
(1) The NJLS model has shown a better performance in comparison to the star model in both the roll-up operations of Impala and Hive. In fact the roll-up execution time of NJLS model is 40% to 49% shorter than that of the star model. The reason behind this is the requirement of enough time by the star model to perform join operations between fact tables and dimension tables, the response time is slow and inefficient. Through hierarchical coding of all dimension information of state monitoring data of substation equipment, a significant number of prolonged operations have been reduced effectively, and the major issue of connection operation that is faced in the conventional star model has been successfully removed and it doesn't affect the distributed ROLAP anymore. (2) Roll-up time of Impala in each state monitoring data set is shorter than that of Hive that is because only underlying Hive executes MapReduce engine and that is still considered a batch process. Other than this impala has adopted traditional MPP database technology and discards MapReduce, which to some extent realizes real-time and interactive query.

Storage overhead
Taking the data set of Figure 10 as an example, following observations may be made: (1) In comparison to the conventional star model storage overhead of NJLS model is reduced by 34% to 40%. Since NJLS34 does not store every attribute related to the dimension table of the state monitoring data in unique fact table, it uses fact table in other way by storing the compressed information in fact table in the form of hierarchical coding that helps in reducing space overhead effectively. (2) Various big data analysis systems suffers from low storage overhead of Hive and Impala, and their storage structure is relatively simple because they directly use the native file format in HDFS to store data without introducing additional data. Hive and Impala big data analysis systems were used to build

Analysis of experimental results
In this section all the important results have been accumulated.
(1) The proposed model is not having any advantage in the data loading when compared with the traditional models because of various important operations required in this particular case only. (2) In the case of roll-up performance the proposed model is having advantage over the traditional models.
And the best part is its performance is independent of the choice of analysis tools and size of the data set. (3) In the proposed model two different tools namely Impala and Hive were used, in which the roll-up performance was better for Impala in both the model (NJLS and star). Not only this, the Impala is delivering a stable operation performance for various scale of monitoring data. In terms of storage overhead and data loading Impala and Hive have a performance of 35.

Conclusion
The proposed model has combined big data with the state monitoring of substation equipment that has led to various advantages over the traditional models. By virtue of this model the data collected by various sensors and other monitoring devices can be utilized more effectively and in various ways. Further it has improved the data analysis and information sharing abilities. Another advantage of this model is its ability to solve the connection operation that is to be performed between a large numbers of complicated tables in case of distributed ROLAP. That makes it more convenient to perform the big data analysis on a distributed cluster and large scale data. Thus, the proposed model is a novel application of the big data in state monitoring of substation that enables the monitoring of substation equipment and allow the proper and effective uses of data collected. The data storage, data mining, data platform architecture design and other related data processes have been improved with this model. These progressive applications have made it apparent that the application of Big data needs to be studied more and more and to be extended in other fields. The proposed model have successfully combined the characteristics of big data analysis for state monitoring of substation with increased efficiency and less speed.