Abstract: Abstract: Query processing is a strategy for getting data from the database dependably. The execution of the database framework relies upon the query processing strategies that we utilized in the database system. Regularly, databases must have the capacity to reply to the clients request in getting data, In vast database frameworks, we see that they may keep running on unpredictable and, unstable environment then it turns out to be difficult to produce database queries efficiently based on the information that is accessible at the compile time, getting the database result in a timely manner deals with the procedure of query optimization. Productive processing of queries is an essential prerequisite in numerous intuitive environments that include a large amount of information. This paper explains the effect of query processing and optimization on the distributed database which requires the transmission of information between PCs in a network. The arrangement of information transmissions and local data processing is known as a distribution strategy for a query. Two cost measures, response time and total time, which are utilized to judge the quality of a distribution strategy. Moreover, different algorithms are utilized that infer distribution methodologies which have a minimal response time and minimal total time, for a special class of queries to determine the performance of the DDB.
Keywords: query processing, query optimization, distributed database.
In general, the Database system should be able to replay requests of its users. Getting data or information from a database system deals with Query Processing, and returning back the result at a convenient time is managed by Query Optimization. The Query Processing and Query Optimization are the essential part of RDBMS, the result of queries should return to its users such as a person, robotic assembly machine or another different DBMS in a timeframe that submitted by the user 5. The Query Processing displays the performance of the database while the Query Optimization displays the response time of the database system.
Furthermore, a Database Query is a request for ordering data from RDBMS to modify or restore specific data, updating and restoring data is performed through different low-level operations in RDBMS, and they also could be relational algebra operations such us project, join, select, Cartesian product, etc. 5.
A Relational Database Management System RDBMS is a specific type of DBMS which uses a relational model, it lets user store data in multiple tables which are related together by mutual fields, and it’s also the most popular type of database system such as MS SQL Server, DB2, Oracle and MySQL, Database Management System DBMS store data in a way that is easier to return manipulate and manufacture information, it enables users to form and manage database and data also can be accessed by multiple users in different locations, it also lets user create, read, update, and delete data in database, and the DBMS can control how an end-user can view data by giving the users different permission to access data in database, users of DBMS can be classified in to three types:
The query processing and query optimization are the most important component of RDBMS “these components are responsible for translating a user query, usually written in a non-procedural language like SQL – into an efficient query evaluation program that can be executed against the database.” (Saurabh Gupta, Gopal Singh Tandel, Umashankar Pandey, 2015)8.
Moreover, the query processing and optimization also have an important role in distributed database (DDB) in term of the performance of the database which measured by different algorithms, in DDB data distributed on various sites we can access those data by query requests, the query processing and optimization utilize the best way for the query to promote the execution of the query, in distributed database queries are impacted by:
1. Insertion method of the data to the server.
2. Transport time among servers.
The response time of the query is depending on transmission time between servers 1.
This paper will explain the effect of query processing and optimization on the distributed database (DDB) including (response time and transmission cost) by explaining some algorithms.
1.1 Distributed Database
A distributed database is a collection of databases that can be kept in a various computer network site “A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network” (Swati Gupta, Kuntal Saroha, Bhawna, 2011), a distributed database management system (DDBMS) is a software that allow the management of the DDB make the distribution clear to the users; each database may include different DBMS and different architectures that distribute the pursuance of procedure 10
It also has an important function nowadays when all sorts of users should be related to the companies’ database, additionally to the company’s own employees such as customers, potential customers and venders need to access to the information in the databases 9, 10.
The idea of the DDB is to store data in the different database over the network in state of having those data in a single database, those data also accessible by different user from different places 9. Moreover, people can access those data with the help of query 2. The processing of distributed query is collected of the following stages 9:
· Local processing phase.
· Reduction phase.
· Final processing phase.
A distributed database has several benefits including 2
1- Improved performance.
3- Availability and reliability.
4- Reduce communication overhead.
5- Easier system expansion.
A distributed database management system (DDBMS) prop the creation and repairing of distributed databases, where data are kept at various sites connected through a network. An objective of DDBMS is to present an easy and united interface to the users so that they can access the databases as if there were a single database. Another important thematic of DDBMS is to operate distributed queries effectively in addition to providing availability and reliability 3.
1.2 Query Processing
A Query Processing is an execution to converting a high-level query in to a low level-language. Most of the queries that suggested to the DBMS are in the high-level language such as SQL, through the Parsing and Translation stage the human readable form is converted to the form that used by DBMS which contain relational algebra expression, query tree and query graph 5. The query processing methods for multiple dimensions are divided in to five different steps bellow 7.
1. Selection Query Model.
2. Data access model.
4. Query and Data uncertainty.
5. Ranking Function.
The transformation of the high-level query to the low-level query by Query Processing is going through virus level as bellow 5 7:
A. Parsing and Translation: In this step a query submitted to DBMS to change the query to the usable form in the high-level query language such as SQL which is show the query as a string or sequence of characters 5 7.
B. Optimization: In this step the query processor gives role to the inner data structure to change this structure to the equivalent. But more effective exemplification. 5 7.
Figure number one illustrates the steps of processing high-level query.
Fig.1 steps of processing high-level query 6
C. Evaluation: The last step of the processing a query, in this step the best estimate plan nominee generated by optimization engine which is first selected then executed 5 7.
The figure bellow illustrates the steps of Query Processing in Database 8.
Fig.2 Query Processing in Database 8.
Query processing is an important solicitude in the area of distributed databases. determine the concatenation and the sites for executing this set of operations such that the operating cost (communication cost and processing cost) for processing this query is decreased, the query processing not only depends on the operations of the query, it also depend on the parameter values that linked with the query. Distributed query processing has an important impact on the performance of a distributed database system 3.
1.3 Query Optimization
The Query Optimization is responsible to return back the most effective result after exclusion by using plan in the timely manner, the Query Optimization finds a plan to decrease the overall execution cost of a query, the process of choosing the lower-cost mechanism is known as Cost-Based Optimization and there are two other strategies to reduce the execution cost of a query which are 5:
1. Heuristic-based optimization.
2. Semantic-based optimization.
The Query Optimization also has three principles which are 6:
1. QEP Generation:
It characterizes the transformations, target language and the source language, and how to build a target language from premier query, the target language reverse the aspect of run time when the QEP is estimated.
Example: “Physical representation of hash tables, an index which determines the usage of varies varieties of access operators. Operators implementing various join methods and index the QEP usage” (Dr.K. Kiran Kumar, T.M. Santhi Sri , Voruganti Vamshi priya, 2015).
2. Search strategy:
User submitted query estimated by some various QEP which are utilized to build options to find appropriate candidate.
3. Cost Function:
This is utilizing to liken of various QEP and finding the best to bring accurate result.
Example (Dr.K. Kiran Kumar, T.M. Santhi Sri, Voruganti Vamshi priya, 2015):
where salary< 3000 This is translated to the following relational algebraic way: ? salary < 3000 (? salary (balance)) (? salary (balance)) ? balance (? salary It is also represented in the following tree method: ? salary<3000 ? salary | | ? salary ? salary< 3000 | | balance balance Query optimization is a difficult mission in a distributed database as data location becomes a main operator. In order to optimize queries carefully, the adequate information should be available to define, the data access techniques are most functional for instance: table and column cardinality, organization information, and index availability. Optimization algorithms have an important effect on the performance of distributed query processing 3. 2. Literature Review In query processing, users of the database mostly assign what data wanted instead of assigning the process to restore required data, therefore, the most important part of query processing is query optimization which is responsible for finding the best way to perform queries in database 2. Additionally, both query processing and optimization have a significant impact on the performance of the distributed database (DDB), there are many methods to optimize queries and used to improve the execution of the distributed database which are explored by studies. One of the study found that the join query can be optimized in distributed database by comparing two methods; The first method for the join query is to transmit data from server to client and then insert data into the client DB then the join-query is executed. The second method, immediately execute the join-query on the client after bringing data from server site and it will not append data to the client DB, from this method the insertion time of the data to the client DB will be cut. Consequently this method is optimizing the join-query in DDB (Pawandeep Kaur, 2013) 1. Another study by (Monjurul Alom, Frace Henskens and Michael Hannaford, 2009) 3, according to join and semi-join strategy they explained different methods such as (Fragmentation and Replication Strategy FRS and Partition and Replicate Strategy PRS) to processing a query while all relations that indicated by a query were not fragmented but they distributed in various sites, this technique is utilized to define the relation that segmented into fragments, and where the fragments forward to processing, furthermore, this method is to process availing parallelism and decreasing the quantity of transporting data for the site, it also supply better capacity for query processing cost when the specific query indicate one relation or all relations for the various sites which display attributes of the query, the researchers in this study were more worried about "fragment more than one referenced non fragmented relations as FRS is not applicable to processing distributed queries in which all of the relations which are non-fragmented but referenced by a query" (Monjurul Alom, Frace Henskens and Michael Hannaford, 2009) 3, they explain these strategies based on six definition (D1-D6), they also describe distributed query optimization issues. Moreover, (Abhijeet Raipurkar and G. R. Bamnote, 2013) 9, they used two method (Simi join based query optimization algorithm, SDD-1 algorithm) to improve the performance of the distributed database, those optimization algorithms have an effective role on the performance of distributed query processing including (reducing response time of the query and the cost of the communication process) (Abhijeet Raipurkar, G. R. Bamnote, 2013) 9. The impact of query processing and optimization on distributed database had been discussed many years ago, in 1979 an article in IEEE Transactions on Software Engineering published and uploaded by Alan Hevner in (2015) 4, this study explained Algorithm G for query processing which is a complete part of distributed database management system, this algorithm progressed and derive a strategy for a distributed query and it progressed in two step process which are: · Algorithm PARALLEL for response time. · Ordered SERIAL strategy for total time. These steps provide minimal response time of distribution for queries and minimization of total time 4. All studies explained some algorithms that improve the performance of the DDB, however, none of them proved which algorithm is the best one among them? Which one has the minimal response time and minimal cost transmission? 3. Query Algorithms: Questions are eventually lessened the numbers of data scan, operations on the hidden physical record structures, for each relational operation, there can exist a few diverse access ways to the specific records required. The query execution engine can have a large number of specific methods intended to process specific relational operation and access way combination, there are two types of algorithms as follows 5 7: 3.1 Selection Algorithms The Select operation must look through the information documents for records meeting the choice criteria. 3.2 Join Algorithms Like selection, the join operation can be executed in an assortment of ways. In terms of disk accesses, the join operations can be exceptionally costly, so executing and using proficient join algorithm is critical in minimizing a query's execution time. 4. Some Query Processing and Optimization Methods in DDB There are various methods to process and optimize queries in database, these methods promote the performance of the query and it also decrease the cost, the optimizer define in which order the query request such as (Joins, Selects, and Projects) should be executed 1. These methods are responsible for returning data in a minimal time with the minimal cost of the transmission. The primal operation that utilized to extract the wanted information from tables (one table, two table or multiple tables) is join and semi join methods, there are different measurements to consider the performance of join and semi join in distributed database system (DDBS) such as (Query Cost, Memory used, CPU Cost, Input Output Cost, Sort Operations, Data Transmission, Total Time and Response Time), the Join method is the most peremptory operation in database that utilized to bring data from two or more than two tables 2. There are some algorithms that improve the performance of distributed database: a. First Algorithm: The parallel query processing method, Join Query Optimization 1 This algorithm focuses on maximizing the number of simultaneous transmission rather than minimizing the quantum of the transmission. Fig. 3 the parallel processing of join query 1 Example: this example taken from 1 Client sends the request for the data from server 1 and server 2 by the queries. After that, server 1 sends the SUPPLY data and server 2 sends the SUPPLIER data to client. Then client inserts the data into its database and performs the join query on the data from two servers If server 1 contains the SUPPLY relation as: SUPPLY(SUPPLY_NO, FROM_PLACE, TO_PLACE) and server 2 contains the SUPPLIER relation as: SUPPLIER(SUPPLY_NO, S_NAME, S_ADDRESS) and client wants the join of the SUPPLY and SUPPLIER relation from server 1 and server 2 respectively and want to perform the query Q. Q: SELECT *FROM SUPPLY S, SUPPLIER sr WHERE s.SUPPLY_NO = sr.SUPPLY_NO In distributed databases, query Q can be divided into three parts: 1. SELECT *FROM SUPPLY 2. SELECT *FROM SUPPLIER 3. SELECT *FROM SUPPLY S, SUPPLIER sr WHERE s.SUPPLY_NO = sr.SUPPLY_NO Queries 1 and 2 select the data from two source tables. Because this data resides on the remote machines, the executions of these two queries do not require data transmission. Query 3 is the join query which cannot be executed until the data on the remote sites have been transferred to the same sites. There are some objects for the optimization to execute the distributed join query that is accessing the data from the distant sites which are: · Size of transmitted data: it's the amount of data that should be transmitted; this data should have the small size in order to take less time for transmission. · Transmission speed: this object relies on the network speed. · Local processing costs: it contains CPU cost, I/O cost, local processing costs can differ with the machine processing speed, these costs should be small and the operations should be executed in an effective way for optimizing query in order to raise the performance of join query. The parallel processing of join query in distributed database depends on some factors: · Time for transmitting data: if the quantity of transmitted data increased then the transmission time also increase, which is depend on the network speed between source and destination. · Time for inserting data: two different operations used to count time taken to add data into the client database from server: 1. Row-by-row insertion. 2. Bulk insertion. · Time for join execution: the execution time of the join query from the client side that joins attributes from server A and server B. The optimization of join query can be performed by using: 1. Various join orders. 2. Alternative "where" clause that will give the same result. 3. Various join methods. 4. Local processing cost of query such as CPU cost and I/O cost. In order to increase the performance of the join query the cost of the transmission should be less; the transmission cost and the insertion cost are the most important if the machine wants the outcome of the join query of data from different machines. Moreover, Insertion from server's data into client database needs the more time than the transmission of the data. Subsequently, the important factor to promote the performance of the join query is bringing data from different sources by using the various insertion functions that take less time. Consequently, in this method, the insertion time of data into client DB will be deducted, then join query will be optimized in distributed databases. 1 4.2.Second Algorithm: SDD-1 2 This algorithm uses the semi-join algorithm to lug the connection between the relationships and to break them; the SDD-1 algorithm has three important advantages as follows: 1. It uses the semi-join operation to lug strategy. 2. The relationship of the whole sites is not duplicate and fragmented. 3. During price rating of the whole algorithm, the transmission cost to the starting site is not calculated. The SDD-1 algorithm is formed of two parts: 1. The basic algorithm. 2. The post-optimality. The SDD-1 algorithm doesn't make full use of the individual of distributed database system. All of the semi join operations are running in order and it will rise the response time of query to an assured range. 4.3.Third Algorithm: Algorithm G 4 The characteristic of this algorithm that make the optimization of query effective is that data transmission is quickly eliminated from estimation if they cannot be a section of a minimal time table, the number of the transmissions is smaller than the number of all possible data transmissions, this algorithm also can be performed utilizing the minimization of response time or total time as it's cost objective. Algorithm G has five steps as follows: 1. (Initialization.) After initial local processing, order the relations so that sl < s2 4 * * < s,. For each joining domain di, of each relation Ri. 2. Repeat Step 3) for each Ri, i = 1, * * *, m, then GO TO Step 4). 3. Build candidate schedules for Ri. 4. Integrate schedules. 5. Build strategy. In the term of response time all possible parallel data transmissions for that mutual joining area are included and contribute to the selectivity of the considered transmission. For total time these parallel transmissions are not included, considering parallel transmissions to minimize the response of a schedule raises the complication of Algorithm G by a considerable amount while the decrease in schedule response time is restricted. This algorithm also progressed and derives a strategy for a distributed query and it progressed in two step processes which are: · Algorithm PARALLEL for response time. · Ordered SERIAL strategy for total time. Consequently, the analysis of Algorithm G is executed only for minimizing total time. Conclusion The most critical functional requirements of a database framework are its capacity to process queries in a timely manner this process is a responsibility of query processing and optimization, the query processing and optimization in distributed framework requires the transmission of information between PCs in a network. The arrangement of information transmissions and local information preparation is known as a distribution technique for a query. Two cost measures, response time and total time, they utilized to judge the quality of a distributed database and there are many algorithms that use to measure the performance of DDB and show the impact of query processing and optimization on the distributed database.