BestPeer A Peer to Peer Based Large Scale Data Processing PlatformBestPeer A Peer to Peer Based Large Scale Data Processing Platform
The corporate network is often used for sharing information among the participating companies and facilitating collaboration in a certain industry sector where companies share a common interest. It can effectively help the companies to reduce their operational costs and increase the revenues. However, the inter-company data sharing and processing poses unique challenges to such a data management system including scalability, performance, throughput, and security. In this paper, we present BestPeer++, a system which delivers elastic data sharing services for corporate network applications in the cloud based on BestPeer—a peer-to-peer (P2P) based data management platform. By integrating cloud computing, database, and P2P technologies into one system, BestPeer++provides an economical, flexible and scalable platform for corporate network applications and delivers data sharing services to participants based on the widely accepted pay-as-you-go business model. We evaluate BestPeer++ on Amazon EC2 Cloud platform.The benchmarking results show that BestPeer++ outperforms HadoopDB, a recently proposed large-scale data processing system, in performance when both systems are employed to handle typical corporate network workloads. The benchmarking results also demonstrate that BestPeer++ achieves near linear scalability for throughput with respect to the number of peer nodes.
Ø Such a warehousing solution has some deficiencies in real deployment.
Ø First, the corporate network needs to scale up to support thousands of participants, while the installation of a large-scale centralized data warehouse system entails nontrivial costs including huge hardware/software investments (a.k.a total cost of ownership) and high maintenance cost (a.k.a total cost of operations) . In the real world, most companies are not keen to invest heavily on additional information systems until they can clearly see the potential return on investment (ROI).
Ø Second, companies want to fully customize the access control policy to determine which business partners can see which part of their shared data.
DISADVANTAGES OF EXISTING SYSTEM:
Ø Most of the data warehouse solutions fail to offer such flexibilities.
Ø Solution has not been designed to handle such dynamicity.
Ø The main contribution of this paper is the design of BestPeer++ system that provides economical, flexible and scalable solutions for corporate network applications. We demonstrate the efficiency of BestPeer++ by benchmarking BestPeer++ against HadoopDB, a recently proposed large-scale data processing system, over a set of queries designed for data sharing applications. The results show that for simple, low-overhead queries, the performance of BestPeer++ is significantly better than HadoopDB.
Ø The unique challenges posed by sharing and processing data in an inter-businesses environment and proposed BestPeer++, a system which delivers elastic data sharing services, by integrating cloud computing, database, and peer-to-peer technologies.
ADVANTAGES OF PROPOSED SYSTEM:
Ø Our system can efficiently handle typical workloads in a corporate network and can deliver near linear query throughput as the number of normal peers grows.
Ø BestPeer++ adopts the pay-as-you-go business model popularized by cloud computing. The total cost of ownership is therefore substantially reduced since companies do not have to buy any hardware/software in advance. Instead, they pay for what they use in terms of BestPeer++ instance’s hours and storage capacity.
Ø BestPeer++ extends the role-based access control for the inherent distributed environment of corporate networks.
Ø BestPeer++ employs P2P technology to retrieve data between business partners.
Ø BestPeer++ is a promising solution for efficient data sharing within corporate networks.
1. Peer++ Processing Approach
2. Parallel P2P Processing
3. Implementing MapReduce
4. Adaptive Query Processing
Peer++ Processing Approach:
BestPeer++ employs two query processing approaches: basic processing and adaptive processing. The basic query processing strategy is similar to the one adopted in the distributed databases domain. Overall, the query submit-ted to a normal peer P is evaluated in two steps: fetching and processing. In the fetching step, the query is decomposed into a set of sub-queries which are then sent to the remote normal peers that host the data involved in the query (the list of these normal peers is determined by searching the indices stored in BATON). The subquery is then processed by each remote normal peer and the intermediate results are shuffled to the query submitting peer P. In the processing step, the normal peer P first collects all the required data from the other participating normal peers. To reduce I/O, the peer P creates a set of Mem Tables to hold the data retrieved from other peers and bulk inserts these data into the local MySQL when the Mem Table is full. After receiving all the necessary data, the peer P finally evaluates the submitted query.
Parallel P2P Processing:
For each join, instead of forwarding all tuples into a single processing node, we disseminate them into a set of nodes, which will process the join in parallel. We adopt the conventional replicated join approach. Namely, the small table will be replicated to all processing nodes and joined with a partition of the large table.
The main difference between MapReduce method and native P2P method comes from the join processing. In MapReduce method, instead of doing replicate joins, the symmetric-hash join approach is adopted. Each mapper reads in its local data and shuffles the intermediate tuple according to the hash value of the join key. Therefore, each tuple only needs to be shuffled once on each level. Note that the configuration and launch of a MapReduce job also incurs certain overhead, which, can be measured in the runtime, is a constant value.
Adaptive Query Processing:
For small jobs, the P2P engine performs better than the MapReduce engine, as it does not incur initialization cost and database join algorithms have been well optimized. However, for large-scale data analytic jobs, the MapReduce engine is more scalable, as it does not incur recursive data replications. Based on the above-mentioned cost models, we propose our adaptive query processing approach. When a query is submitted, the query planner retrieves related histogram and index information from the bootstrap node, analyzes the query and constructs a processing graph for the query. Then the costs of both the P2P engine and MapReduce engine are predicted based on the histograms and runtime parameters of the cost models. The query planner compares the costs between two methods and executes the one with lower cost.
Ø System : Pentium IV 2.4 GHz.
Ø Hard Disk : 40 GB.
Ø Floppy Drive : 1.44 Mb.
Ø Monitor : 15 VGA Colour.
Ø Mouse : Logitech.
Ø Ram : 512 Mb.
Ø Operating system : Windows XP/7.
Ø Coding Language : JAVA/J2EE
Ø IDE : Netbeans 7.4
Ø Database : MYSQL
click Here To Download