SEARCH WITHIN CONTENT
Citation Information : International Journal of Advanced Network, Monitoring and Controls. Volume 3, Issue 2, Pages 80-83, DOI: https://doi.org/10.21307/ijanmc-2018-034
License : (CC-BY-NC-ND-4.0)
Published Online: 07-May-2018
With fast development and deep appliance of the Internet, problem of mass image data storage stand out, so the problem of low management efficiency, low storage ability and high cost of traditional storage framework has appeared. The appearance of Hadoop provides a new thought. However, Hadoop itself is not suit for the handle of small files. This paper puts forward a storage framework of mass image files based on Hadoop, and solved the internal storage bottleneck of NameNode when small files are excessive through classification algorithm of preprocessing module and lead-in of high efficiency and first-level of index mechanism. The test manifests that the system is safe, easy to defend and has fine extension quality; as a result, it can reach to a fine effect.
With fast development and deep appliance of the Internet, it appears more and more large portal sites, e-business websites and net community. These websites have large picture resource and the number of pictures increases in high speed. Traditional technological framework cannot play an active role in dealing with massive images. How to build a cheap and efficient picture storage management system in the precondition of fulfilling high concurrent access has become a to-be-solved problem.
This paper puts forward a massive image storage module based on Hadoop through analysis of HDFS of Hadoop, research of MapReduce and business demand analysis of picture storage. This system has a framework of Master\Slave and based on HDFS distributed file system of Hadoop. It’s hardware is constructed on the common Linux machine clusters. High error tolerance, high response and load balancing can be realized through indoor monitor and it provides outdoor service, what’s more, it can also fulfill high concurrency and depen dable appliance.
The system adopts HA framework and smooth expansion which enable the availability and extension of the whole file system. Besides, the system uses flat data organization structure through efficient extended first-level index mechanism, in this way, which can place sequence file and offset position relationship of pictures and files quickly. Storage perfection and stability of every storage node is realized through load balancing and cache system design. The system employs MVC3 layer design after combining nature of different structure of massive image data, distribution and diversity and considering the realization of system organization. As a result, the structure is clearer and the system is easier to extend. This overall platform consists of three layers, including data resource layer, business logic layer, application interface layer. Data resource layer is the basis of whole platform and the fundamental part of cloud storage. Business logic layer can parallel process massive image data and manage the entire platform system configuration. It is the most important part of the cloud storage, but also has technical content. The main function is to coordinate a plurality of storage apparatus underlying business logic layer, to provide a uniform API interface to the upper layer application service and to shield storage apparatus on the lower layer.
Considering system’s function, the whole system’s overall function module structure is presented like Figure 1.
At the bottom is the data storage layer, mainly composed of a storage module. For massive data storage, different data sources provided by different database need to be shielded by storage module to provide database access services, so that the system can meet the requirement of handling massive data storage, as a result, the system has better extendibility and completeness which is ease of management and deployment. A large number of low-cost machines are combined into a cluster through HDFS of Hadoop which provide massive storage capacity.
Business logic layer is the core of the whole system, but also key elements of the system design and development. It uses a distributed database technology and Linux cluster technology, providing the main function of the massive data parallel load storage. The processed data is stored at distributed database of the system, through the massive parallel processing of data in this layer, while also providing management support services to ensure system normal operation. This layer consists of five functional modules.
Image file preprocessing module is to preprocess the image file, file name design and picture metadata management. Preprocessing module combines the strong correlation files into Sequence File by using the classification algorithm, which greatly reduces the number of files in HDFS. Through the design of the image file name, the image can be easily read.
Storage control module’s main function is to provide storage management web interface and a unified command, realizing the management of the storage layer nodes via web interface. Hadoop provides opened port 50070 for viewing NameNode information. You can view the current normal Node, Node failures, NameNode log information through this interface.
The main purpose of caching service module is to build buffer. The client requests sent to here by cache buffer screening, where we use Redis to build caching system. Redis is similar to traditional Memcached. They both are a <Key/Value> type storage system. Value type of storage it supports is relatively more, including String, List, Set and zset. Push\pop, add\ remove and intersection, union set, difference Set, operations or richer operations are supported by these data types, and these operations are atomic.
Business processing module is mainly to solve image processing tasks. As usual, uploaded Internet application pictures need to be processed such as, upload picture as generate thumbnails, upload portrait with Image Segmentation. In the case of bulk upload pictures, if the picture processing focus on the application server, the application server will take up CPU resources, affecting the quality of the application service. Traditional approach is a machine alone does image processing. Such an approach can be decoupled from the application. However, after the emergence of MapReduce, we can process pictures based on MapReduce. Image processing is distributed on the cheap storage machine of image storage and stored directly after processing which not only save hardware resources, but also deal with stress distribution to each node.
Load balancing module is mainly used to solve the stability under high concurrent use of the system. We deploy load balancing for the entire system, using HAProxy of RoundRobin load balancing algorithm, dividing front-end user requests pressure to each Web image server. HAproxy provides high availability, load balancing, and proxy TCP and HTTP-based applications. It is a free, fast and reliable solution. Our requests are sent to different servers through distinction of read and write. Picture read requests are sent to the image server. On the one hand it reads the picture metadata information through cache, on the other hand it hosts image through NameNode.
This layer is made up by GUI interface module based on users and API interface module based on calculating ways. This interface module faces towards users for whom providing different kinds of operating applied tools. So users have easy access to huge data storage process. We can provide compiling applied system and assign API in calculating storage according to some high-lever users having higher and more requirements, which can realize needed applied function.
Operation system: Ubuntu 9.04
Distributed file system: Hadoop0.20.2. The motion of Hadoop needs JDK environment and this paper chooses JDK1.6.0_31, including a NameNode server, a JobTracker server and four DataNodeservers.
Image server: Nginx-0.9.6
Caching software: Redis
Load balancing software: HAProxy
Java environment: JRE6
Java develop instrument: eclipse3.2
The main install process of Hadoop is as follows: first dispose host file, create new users and catalogue, install JDK and dispose environment index; then dispose SSH enrollment without a code, install Hadoop and its disposing.
We first adopts applied server complication line to do “store and take” operation. We choose pictures randomly to analyze, finding the size of pictures is between 2K and 10M, and the effect is like Figure 2.
Storage system constructed by Hadoop can read TPS (transmission number of operations per second) and it increases with threads, but its growth gradually slowed down to 70 when the thread reached to the first peak, and then add read threads, and TPS is no longer growing stably. Traditional NFS storage systems require multiple IO when read pictures, so when the number of threads have reached to the 40, the system arrive to its load limit. While NFS storage systems using Hadoop’s new storage system reduces the number of IO operations, under the same machine configuration, youcan achieve more concurrent read operations.
We then analyzed using random writes pictures. The picture size distribution is in the 2K-10M, including small pictures and big pictures. After the client concurrent requests, HDFS will store pictures. The effect is shown in Figure 3.
Written TPS’s thread count peaked at 60, and then an additional written thread is involved, TPS is no longer growing stably at this moment. Written performance of the new storage system of Hadoop is superior to NFS written performance which ensure the high throughput of the system.
By analyzing renderings of concurrent read and write multi-thread, you can find a new storage system uses Hadoop in optimizing the storage procedure, reducing the number of IO access process and improving the carrying capacity of the system. In a word, the overall deposit is better than NFS systems. Especially in the face of high concurrent read and write operations, the new storage system HDFS guarantee the stability of the system and the security of the data. So we can be sure that the new storage systems of Hadoop have the ability to solve the problem of massive storage ability.
This paper designs and develops massive pictures and data storage platform based on Hadoop, adopting the technology of Linux data base and parallel distributed database technology. This paper employs a series of pictures and data management ways such as: HDFS distributed file system, Map/Reduce parallel calculation module and HBase data base technology. Besides, the paper also adopts Redis and HAProxy to construct buffer and load balancing, in this way, the whole system reach to a stable and healthy condition. If this platform is built on many cheap and common computers, the requirement of high efficiency and pictures and data management can be reached. The result of platform module realization manifests: this system has fine extension quality and is easy to defend, so the technological route and design method adopted by the system is effective and feasible.