In thisstudy we present and analyze different Cost Model and query planning techniquein Database Management System specially for big data where we do need a veryfast and efficient query execution.
Analyzing graphs is a very basic problem inbig data analytics, for which DBMS technology does not seem competitive. On theother hand, SQL recursive queries are a fundamental mechanism to analyze graphsin a DBMS, whose processing and optimization is significantly harder thantraditional SPJ queries. We try our best to cover all those this and give acomplete survey on query planning and cost modeling. Introduction:Efficientlyand fast querying is a most important challenge in real word for moderndatabase systems.
Fast data is becoming crucialand asset for researchers. In this study we investigate the following problem: how can wegenerate a high-quality query plans which runs fast, in a result we canminimizing the response time and query run more fast also we will discuss and descriedcast model in different dimensional data space.There aremany query planning techniques are presented and also there are severaltechniques for cost model are presented. We will discuss pattern matching overcompressed graph1, distributed SQL query execution on multiple engine2, comparecolumn, row and array data base management system to process queries, costmodel for neighbor search and real time processing techniques and cost model inhigh dimensional data for query processing3. We will discuss different factoron query planning and review the related cost models describe by the differentauthor form all over the word. 2-HISTORY OF QUERY PLANNINGAND COST MODELSFor dataprocessing the most popular platforms on the cloud are based on MapReducepresented by Google.
On top of the MapReduce, Google has also build a systemsFlumeJava, Tenzing and Sawzall. To write data pipelines FlumeJava is a libraryis used MapReduce jobs are transformed into it. Over big datastesSawezall can be expressed that is a scripting language. To minimize the latencyTenzing is a analytical query engine that is used to pre-allocate machine. Hodoopis the main implementation source of MapReduce by Yahoo. For facebook Hive is awherehouse solution. The query language of Hive (HiveQL) a subset optimization techniques and SQL arelimited to simple transformation rules. Our optimization goal is to maximizethe parallelism and minimize the No of MapReduce jobs and minimize theexecution time of the query.
HadoopDB is recent hybridsystem that combines MapReduce with databases. It uses multiple single nodedatabases and rely on Hadoop to schedule the jobs to each database. Theoptimization goal is to create as much parallelism as possible by assigningsub-queries to the singe node databases.TheCondor/DAGMan/Stork is the state-of-the-art technology of High PerformanceComputing. Nevertheless, Condor was designed for CPU to harvest cycles on idlemachines. However with DAGMan running data intensive workflow is veryinefficient. DAGManis used as middleware in many systems, like Pegasus andGridDB .
To deal with data intensive scientific workflows proposal forextensions of condor do exist , but they have not been materialized yet to thebest of our knowledge. It is presented a case study of executing the Montagedataflow on the cloud examining the trade-offs of different dataflow executionmodes and provisioning plans for cloud resources. By Microsoft Dryad is acommercial middleware that has a more general architecture than MapReduce.
itcan parallelize any dataflow. Its schedule optimization, however, reliesheavily on hints requiring knowledge of node proximity, which are generally notavailable in a cloud environment. It deals with job migration by instantiatinganother copy of a job and not by moving the job to another machine. Whenoptimizing solely time but not allocating additional containers matter whenfinancial cost it might be acceptable. On top of Dryad dryadLinQ it is builtand use LINQ , a set of .NET constructs for manipulating data.
LINQ queries aretransformed into Dryad graphs and executed in a distributed fashion. One of thefirst distributed database systems that takes into consideration the monetary cost of answering the queries wasMariposa. The user provides a budget function and the system optimizes the costof accessing the individual databases using auctioning.3-Query planning and Cost modling Techniques:In thisstudy we reviewed ten paper and study different technique for Query Planningand Cost Modeling technique. To improving the performance of particular typesof graph operations one solution, is to reduce the size of the original graph G6-7 by turning it into a smaller graph G as proposed by both the data miningand theoretical computer science communities. AntonioMaccioni proposed a solution to improves the performance of various graphoperations, such as the bounded approximation of the laplacian matrix or the boundedapproximation of graph isomorphisms1.
They introduce the concept ofdedensification and parameterized the algorithem via threshold. he is more focusedon query execution and he experiment his work with indexing and non indexingrecords. His experiments show that dedensification improves performance forqueries involving high-degree nodes, sometimes by an order of magnitude.IctorGiannakouris, Nikolaos Papailiou Worked on distributed query execution overmultiple engine environment2. In there Proposed solution MuSQLE can efficientlyutilize external SQL engines allowing for both intra and inter engineoptimizations. There system adopt a novel API based strategy. MuSQLEspecifies a generic API, used for the cost estimation and query execution, Insteadof manual integration, that needs to be implemented for each SQL engineendpoint.It individually perform sub query optimization.
As a Result MuSQLE canprovide speedups of up to 1 order of magnitude for TPCH queries, leveragingdifferent engines for the execution of individual query parts.StefanBerchtold, Christian Böhm and Daniel A. Keim Present there work about costmodeling For Nearest Neighbor Search in High-Dimensional Data Space4 They first analyze different nearest neighbor algorithms andpresent a new cost model for nearest neighbor search in high-dimensional dataspace. The results that they get after applying there model to Hilbert and X-tree indices show that it providesa good estimation of the query performance, which is considerably better thanthe estimates by previous models especially for high dimensional data.In ananother study Carlos Ordonez, Wellington Cabrera, Achyuth Gurram present theresolution about recursive query Graph3. In this paper, they present a new costmodel for nearest neighbor search in high-dimensional data space by analyzing differentnearest neighbor algorithms.
Their works for an arbitrary number of data pointsand for data sets with an arbitrary number of dimensions , is applicable todifferent index structures and data distributions , and provides accurateestimates of the expected query execution time. A Costmodel for index structures for point databases such as the R *-tree and theX-tree is presented by Christian Bohm10. he BBKKmodel introduced two techniques, Minkowski sum and data space clipping, toestimate the number of page accesses when performing range queries andnearest-neighbor queries in a high dimensional Euclidean space