Outlier detection algorithms are widely used in various fields such as network intrusion detection, and medical aided diagnosis. Local Distance-Based Outlier Factor (LDOF), Cohesiveness-Based Outlier Factor (CBOF) and Local Outlier Factor (LOF) algorithms are classic algorithms for outlier detection with long execution time and low detection rate on large-scale datasets and high dimensional datasets. Aiming at these problems, an outlier detection algorithm Based on Graph Random Walk (BGRW) was proposed. Firstly, the iterations, damping factor and outlier degree for every object in the dataset were initialized. Then, the transition probability of the rambler between objects was deduced based on the Euclidean distance between the objects. And the outlier degree of every object in the dataset was calculated by iteration. Finally, the objects with highest outlier degree were output as outliers. On UCI (University of California, Irvine) real datasets and synthetic datasets with complex distribution, comparison between BGRW and LDOF, CBOF, LOF algorithms about detection rate, execution time and false positive rate were carried out. The experimental results show that BGRW is able to decrease execution time and false positive rate, and has higher detection rate.
Concerning the low server utilization and complicated energy management caused by block random placement strategy in distributed file systems, the vector of the visiting feature on data block was built to depict the behavior of the random block accessing. K-means algorithm was adopted to do the clustering calculation according to the calculation result, then the datanodes were divided into multiple regions to store different cluster data blocks. The data blocks were dynamic reconfigured according to the clustering calculation results when the system load is low. The unnecessary datanodes could sleep to reduce the energy consumption. The flexible set of distance parameters between clusters made the strategy be suitable for different scenarios that has different requests for the energy consumption and utilization. Compared with hot-cold zoning strategies, the mathematical analysis and experimental results prove that the proposed method has a higher energy saving efficiency, the energy consumption reduces by 35% to 38%.
The emergence of RAMCloud has improved user experience of Online Data-Intensive (OLDI) applications. However, its energy consumption is higher than traditional cloud data centers. An energy-efficient strategy for disks under this architecture was put forward to solve this problem. Firstly, the fitness function and roulette wheel selection which belong to genetic algorithm were introduced to choose those energy-saving disks to implement persistent data backup; secondly, reasonable buffer size was needed to extend average continuous idle time of disks, so that some of them could be put into standby during their idle time. The simulation experimental results show that the proposed strategy can effectively save energy by about 12.69% in a given RAMCloud system with 50 servers. The buffer size has double impacts on energy-saving effect and data availability, which must be weighed.
Like MapReduce, tasks under big data environment are always with data-dependent constraints. The resource selection strategy in distributed storage system trends to choose the nearest data block to requestor, which ignored the server's resource load state, like CPU, disk I/O and network, etc. On the basis of the distributed storage system's cluster structure, data file division mechanism and data block storage mechanism, this paper defined the cluster-node matrix, CPU load matrix, disk I/O load matrix, network load matrix, file-division-block matrix, data block storage matrix and data block storage matrix of node status. These matrixes modeled the relationship between task and its data constraints. And the article proposed an optimal resource selection algorithm with data-dependent constraints (ORS2DC), in which the task scheduling node is responsible for base data maintenance, MapRedcue tasks and data block read tasks take different selection strategies with different resource-constraints. The experimental results show that, the proposed algorithm can choose higher quality resources for the task, improve the task completion quality while reducing the NameNode's load burden, which can reduce the probability of the single point of failure.
For low server utilization and serious energy consumption waste problems in cloud computing environment, an energy-efficient strategy for dynamic management of cloud storage replica based on user visiting characteristic was put forward. Through transforming the study of the user visiting characteristics into calculating the visiting temperature of Block, DataNode actively applied for sleeping so as to achieve the goal of energy saving according to the global visiting temperature.The dormant application and dormancy verifying algorithm was given in detail, and the strategy concerning how to deal with the visit during DataNode dormancy was described explicitly. The experimental results show that after adopting this strategy, 29%-42% DataNode can sleep, energy consumption reduces by 31%, and server response time is well. The performance analysis show that the proposed strategy can effectively reduce the energy consumption while guaranteeing the data availability.
Through the analysis and research of reliability problems in the existing workflow scheduling algorithm, the paper proposed a reliability-based workflow strategy concerning the problems in improving the reliability of the entire workflow by sacrificing efficiency or money in some algorithms. Combining the reliability of tasks in workflow and duplication ideology, and taking full consideration of priorities among tasks, this strategy lessened failure rate in transmitting procedure and meantime shortened transmit time, so it not only enhanced overall reliability but also reduced makespan. Through the experiment and analysis, the reliability of cloud workflow in this strategy, tested by different numbers of tasks and different Communication to Computation Ratios (CCR), was proved to be better than the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and its improved algorithm named SHEFTEX, including the superiority of the proposed algorithm over the HEFT in the completion time.
RAMCloud stores data using log segment structure. When large amount of small files store in RAMCloud, each small file occupies a whole segment, so it may leads to much fragments inside the segments and low memory utilization. In order to solve the small file problem, a strategy based on file classification was proposed to optimize the storage of small files. Firstly, small files were classified into three categories including structural related, logical related and independent files. Before uploading, merging algorithm and grouping algorithm were used to deal with these files respectively. The experiment demonstrates that compared with non-optimized RAMCloud, the proposed strategy can improve memory utilization.