JP5229731B2

JP5229731B2 - Cache mechanism based on update frequency

Info

Publication number: JP5229731B2
Application number: JP2008260892A
Authority: JP
Inventors: 洋堀井; 陽介小澤; 清久仁河内谷
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-10-07
Filing date: 2008-10-07
Publication date: 2013-07-03
Anticipated expiration: 2028-10-07
Also published as: JP2010092222A

Description

本発明は、更新頻度に基づくキャッシュ機構に関する。特に、本発明は、並列性の高いプログラムを記述可能なＭａｐ−Ｒｅｄｕｃｅプログラミングモデルを拡張したプログラミングモデルに用い得るキャッシュ機構に関する。 The present invention relates to a cache mechanism based on update frequency. In particular, the present invention relates to a cache mechanism that can be used in a programming model that is an extension of the Map-Reduce programming model that can describe a highly parallel program.

複数サーバに配置されたデータ（ファイル）を処理するための分散プログラミングモデルであるＭａｐ−Ｒｅｄｕｃｅプログラミングモデルが近年利用されてきている（非特許文献１）。Ｍａｐ−Ｒｅｄｕｃｅプログラミングモデルは、任意のデータからキー・値のペアを生成するＭａｐ関数とその中間データから同じキーを持つ値を統合するＲｅｄｕｃｅ関数とからなり、一般に数百台から数千台のパーソナル・コンピュータ（Ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ：ＰＣ）クラスタ上で実行される。このＭａｐ−Ｒｅｄｕｃｅプログラミングモデルを実装したものには、例えば、Ｈａｄｏｏｐというオープンソースのソフトウエアがある（非特許文献２）。 In recent years, a Map-Reduce programming model, which is a distributed programming model for processing data (files) arranged in a plurality of servers, has been used (Non-Patent Document 1). The Map-Reduce programming model is composed of a Map function that generates key / value pairs from arbitrary data and a Reduce function that integrates values having the same key from the intermediate data. Generally, hundreds to thousands of personal computers are used. -It is executed on a computer (Personal computer: PC) cluster. An example of the implementation of this Map-Reduce programming model is open source software called Hadoop (Non-patent Document 2).

Ｍａｐ−ＲｅｄｕｃｅプログラミングモデルによるＭａｐ−Ｒｅｄｕｃｅ処理は多数のＰＣ上で行なわれるが、その中には、「マスタ（Ｍａｓｔｅｒ）」と「ワーカー（Ｗｏｒｋｅｒ）」という２つのサーバが存在する。マスタは、Ｍａｐ−Ｒｅｄｕｃｅ処理全体の動作を管理し、ワーカーに仕事を割り振る。ワーカーは、マスタの要求に従って、Ｍａｐ関数、または、Ｒｅｄｕｃｅ関数のいずれかを実行する。但し、ワーカーは、Ｍａｐ関数、Ｒｅｄｕｃｅ関数のいずれか一方ではなく、必要に応じていずれの処理も行なえるようになっている。 The Map-Reduce process according to the Map-Reduce programming model is performed on a large number of PCs. Among them, there are two servers, a “master” and a “worker”. The master manages the operation of the entire Map-Reduce process and allocates work to workers. The worker executes either the Map function or the Reduce function according to the request of the master. However, the worker can perform any process as needed, not one of the Map function and the Reduce function.

以下、ワーカーがデータに対してＭａｐ関数を実行することをＭａｐ処理といい、Ｍａｐ処理を行なっているワーカーをＭａｐワーカー（図面においては、ＭａｐＷｏｒｋｅｒと記載）という。同様に、ワーカーがデータに対してＲｅｄｕｃｅ関数を実行することをＲｅｄｕｃｅ処理といい、Ｒｅｄｕｃｅ処理を行なっているワーカーをＲｅｄｕｃｅワーカー（図面においては、ＲｅｄｕｃｅＷｏｒｋｅｒと記載）という。 Hereinafter, the execution of the Map function by the worker is referred to as Map processing, and the worker performing the Map processing is referred to as a Map worker (denoted as MapWorker in the drawing). Similarly, execution of a Reduce function on data by a worker is referred to as Reduce processing, and a worker performing the Reduce processing is referred to as a Reduce worker (denoted as ReduceWorker in the drawing).

具体的に、Ｍａｐ処理は、ＭａｐワーカーがＭａｐ関数を利用して、ローカルのデータからマップ型の結果（Ｍａｐ結果）を生成する。Ｒｅｄｕｃｅ処理は、ＲｅｄｕｃｅワーカーがＲｅｄｕｃｅ関数を利用して、全てのＭａｐ結果内の、同じキー値を持つ全ての値から１つの値（Ｒｅｄｕｃｅ結果）を生成する。複数ファイルのワード集計を例に、Ｍａｐ−Ｒｅｄｕｃｅ処理の概要について説明する。図１４は、従来技術に係る、Ｍａｐ−Ｒｅｄｕｃｅ処理概要を示す図である。 Specifically, in the Map process, a Map worker generates a map-type result (Map result) from local data using a Map function. In the Reduce process, a Reduce worker uses a Reduce function to generate one value (Reduce result) from all values having the same key value in all Map results. The outline of the Map-Reduce process will be described by taking a word count of a plurality of files as an example. FIG. 14 is a diagram illustrating an outline of Map-Reduce processing according to the related art.

まず、Ｍａｐワーカーは、それぞれが担当するファイル内に含まれる、全てのワードの出現回数を生成する（Ｍａｐ処理）。Ｍａｐワーカー１は、ｆｉｌｅ１（ｃａｔ，ｆｏｘ，ｄｏｇ，ｃａｔ）とｆｉｌｅ２（ｆｏｘ，ｆｏｘ，ｆｏｘ，ｒａｔ）との２つのファイルに含まれる、全てのワードの出現回数をＭａｐ結果１０として生成する。同様に、Ｍａｐワーカー２はｆｉｌｅ３とｆｉｌｅ５との２つのファイル、Ｍａｐワーカー３はｆｉｌｅ４の１つのファイル、それぞれに含まれる全てのワードの出現回数をそれぞれＭａｐ結果２０，３０として生成する。すなわち、図１４に示すように、Ｍａｐ結果はワードを「キー」、ワードの出現回数を「値」とするＭａｐ型データとして、Ｍａｐワーカーから生成される。 First, the Map worker generates the number of appearances of all the words included in the file each of which is in charge (Map processing). The Map worker 1 generates, as a Map result 10, the number of appearances of all words included in the two files file1 (cat, fox, dog, cat) and file2 (fox, fox, fox, rat). Similarly, Map worker 2 generates two files, file 3 and file 5, Map worker 3 generates one file of file 4, and the number of appearances of all words included in each file is generated as Map results 20 and 30, respectively. That is, as shown in FIG. 14, the Map result is generated from the Map worker as Map type data in which the word is “key” and the number of occurrences of the word is “value”.

次に、Ｒｅｄｕｃｅワーカーは、Ｍａｐ結果のキー毎の出現回数を計算し、生成する（Ｒｅｄｕｃｅ処理）。図１４において、Ｒｅｄｕｃｅワーカー１はキー「ｃａｔ」およびキー「ｄｏｇ」、Ｒｅｄｕｃｅワーカー２はキー「ｆｏｘ」およびキー「ｒａｔ」それぞれに関し、Ｒｅｄｕｃｅ処理を行なう。 Next, the Reduce worker calculates and generates the number of appearances for each key of the Map result (Reduce process). In FIG. 14, the Reduce worker 1 performs the Reduce process for the key “cat” and the key “dog”, and the Reduce worker 2 performs the Reduce process for the key “fox” and the key “rat”.

Ｒｅｄｕｃｅワーカー１は、Ｍａｐ結果１０、２０、３０からキー「ｃａｔ」に対するＭａｐ結果を収集し、そのＭａｐ結果に含まれる値の和をＲｅｄｅｃｅ結果１１として生成する。Ｒｅｄｕｃｅワーカー１は、キー「ｄｏｇ」に関してもキー「ｃａｔ」と同様のＲｅｄｕｃｅ処理を行い、Ｒｅｄｅｃｅ結果１１を生成する。また、Ｒｅｄｕｃｅワーカー２は、キー「ｆｏｘ」およびキー「ｒａｔ」それぞれに関し、Ｒｅｄｕｃｅ処理を行ない、Ｒｅｄｅｃｅ結果２１を生成する。なお、Ｍａｐ結果１０’、２０’、３０’は、Ｍａｐ処理とＲｅｄｕｃｅ処理との間に行なわれる、Ｍａｐ結果内の同じキーの値を集約する、シャッフル（Ｓｈｕｆｆｌｅ）という作業を行なった結果である。このように、Ｍａｐ−Ｒｅｄｕｃｅプログラミングモデルを利用することで、ワード集計のような大量のファイルに対する処理を、並列に処理することが可能となる。その結果として、処理全体を低レイテンシで実現することが可能となる。 The Reduce worker 1 collects the Map result for the key “cat” from the Map results 10, 20, and 30, and generates the sum of the values included in the Map result as the Receive result 11. The Reduce worker 1 performs a Reduce process similar to that for the key “cat” with respect to the key “dog”, and generates a Receive result 11. Also, the Reduce worker 2 performs a Reduce process on each of the key “fox” and the key “rat”, and generates a Receive result 21. Note that the Map results 10 ′, 20 ′, and 30 ′ are the results of performing a work called “Shuffle” that is performed between the Map process and the Reduce process and aggregates the values of the same keys in the Map result. . In this way, by using the Map-Reduce programming model, it is possible to process a large number of files such as word aggregation in parallel. As a result, the entire process can be realized with low latency.

Jeffrey Dean and Sanjay Ghemawat、“MapReduce: Simplified Data Processing on Large Clusters”［online］、平成２０年６月３日、［平成２０年８月２６日検索］、インターネット、＜URL：http://labs.google.com/papers/mapreduce-osdi04.pdf＞Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters” [online], June 3, 2008, [Search August 26, 2008], Internet, <URL: http: // labs. google.com/papers/mapreduce-osdi04.pdf> Hadoop“Welcome to Hadoop”、［online］、［平成２０年１０月７日検索］、インターネット、＜URL：http://hadoop.apache.org/core/＞Hadoop “Welcome to Hadoop”, [online], [October 7, 2008 search], Internet, <URL: http://hadoop.apache.org/core/>

Ｍａｐ−Ｒｅｄｕｃｅ処理を効率化するために、Ｍａｐ−Ｒｅｄｕｃｅプログラミングモデルに、ＲｅｄｕｃｅワーカーがＲｅｄｕｃｅ結果を自身のキャッシュに保存する機能を追加し、拡張することが考えられる。つまり、Ｒｅｄｕｃｅ結果を生成したＲｅｄｕｃｅワーカーとなっているサーバ装置のキャッシュに、Ｒｅｄｕｃｅ結果を保存することができる。それにより、同じファイルに対して、Ｍａｐ−Ｒｅｄｕｃｅ処理を定期的に実行する場合に、キャッシュに保存したＲｅｄｕｃｅ結果を利用することが可能となる。 In order to improve the efficiency of the Map-Reduce process, it is conceivable to extend the Map-Reduce programming model by adding a function for the Reduce worker to save the Reduce result in its own cache. In other words, the Reduce result can be stored in the cache of the server device that is the Reduce worker that generated the Reduce result. Thereby, when the Map-Reduce process is periodically executed on the same file, the Reduce result stored in the cache can be used.

しかし、キャッシュに保存されているＲｅｄｕｃｅ結果は、Ｍａｐ−Ｒｅｄｕｃｅ処理対象のファイルに更新が加わり、Ｍａｐワーカーが生成するＭａｐ結果が更新になると、更新になったＭａｐ結果から得られるＲｅｄｕｃｅ結果と整合性が無く、利用できないため、無効化する必要がある。そのため、Ｒｅｄｕｃｅ結果をキャッシュに保存しても有効に利用できないという問題点がある。 However, the Reduce result stored in the cache is consistent with the Reduce result obtained from the updated Map result when the Map-Reduce processing target file is updated and the Map result generated by the Map worker is updated. Since it is not available, it must be disabled. Therefore, there is a problem that even if the Reduce result is stored in the cache, it cannot be used effectively.

そこで、本発明は上記課題に鑑み、Ｍａｐ−Ｒｅｄｕｃｅプログラミングを、キャッシュを有効活用することができるプログラミングに拡張し、更新頻度に基づく、拡張Ｍａｐ−Ｒｅｄｕｃｅプログラミングモデルの処理におけるキャッシュ機構の構築方法、およびシステムを提供することを目的とする。 Therefore, in view of the above-described problems, the present invention extends Map-Reduce programming to programming that can effectively use a cache, and builds a cache mechanism in processing of the extended Map-Reduce programming model based on the update frequency, and The purpose is to provide a system.

本発明の１つの態様では、以下のような解決手段を提供する。 In one aspect of the present invention, the following solution is provided.

本発明の１つの態様によると、Ｍａｐ処理とＲｅｄｕｃｅ処理とを実行し、複数のデータを分散処理するＭａｐ−Ｒｅｄｕｃｅ処理システムにおいて、前記Ｒｅｄｕｃｅ処理のためのキャッシュ機構を構築する方法であって、前記Ｍａｐ処理の結果に対して前記Ｒｅｄｕｃｅ処理を部分的に実行し、部分的に処理した結果を段階的に処理するＲｅｄｕｃｅＭｅｒｇｅ処理を追加した、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理システムにおいて、キャッシュ機構を構築する方法を提供する。キャッシュ機構を構築する方法は、処理対象である複数のデータを、それらデータそれぞれの更新頻度に基づいて、複数のグループに分けるステップと、複数のグループそれぞれを構成するデータの更新頻度であるデータ更新頻度に基づいて、複数のグループそれぞれの更新頻度であるグループ更新頻度を計算するステップと、複数のグループのうち、グループ更新頻度が予め設定されたグループ更新頻度の閾値以下であるグループそれぞれに対してＭａｐ処理を実行して複数のＭａｐ結果を生成するステップと、複数のＭａｐ結果に対して部分的にＲｅｄｕｃｅ処理を実行して、複数の部分的Ｒｅｄｕｃｅ結果を生成するステップと、複数の部分的Ｒｅｄｕｃｅ結果に対してＲｅｄｕｃｅＭｅｒｇｅ処理を段階的に実行し、新たな部分的Ｒｅｄｕｃｅ結果を生成するステップと、生成された部分的Ｒｅｄｕｃｅ結果をキャッシュするステップとを含む。 According to one aspect of the present invention, in a Map-Reduce processing system that executes Map processing and Reduce processing and performs distributed processing of a plurality of data, a method for constructing a cache mechanism for the Reduce processing, Method for constructing a cache mechanism in a Map-Reduce-ReduceMerge processing system, in which a ReduceMerge process that adds a ReduceMerge process that partially executes the Reduce process to a Map process result and processes the partially processed result in stages I will provide a. A method of constructing a cache mechanism includes a step of dividing a plurality of data to be processed into a plurality of groups based on the update frequency of each of the data, and a data update that is an update frequency of data constituting each of the plurality of groups A step of calculating a group update frequency, which is an update frequency of each of the plurality of groups, based on the frequency, and for each of the groups having a group update frequency equal to or lower than a preset group update frequency threshold among the plurality of groups Executing a Map process to generate a plurality of Map results; partially executing a Reduce process on the plurality of Map results to generate a plurality of Partial Reduce results; and a plurality of Partial Reduces Execute ReduceMerge processing on the results step by step to create a new partial R And generating a duce results, and the step of caching the generated partial Reduce results.

本態様によると、Ｍａｐ−Ｒｅｄｕｃｅ処理に、ＲｅｄｕｃｅＭｅｒｇｅ処理を追加することにより、一度にＲｅｄｕｃｅ処理をするのではなく、部分的Ｒｅｄｕｃｅ結果を生成し、段階的にＲｅｄｕｃｅ処理を行うようにする。また、Ｍａｐ−Ｒｅｄｕｃｅ処理にＲｅｄｕｃｅＭｅｒｇｅ処理を追加して拡張したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理において、グループ更新頻度が閾値以下であるグループに対するＲｅｄｕｃｅ処置およびＲｅｄｕｃｅＭｅｒｇｅ処理段階の部分的Ｒｅｄｕｃｅ結果をキャッシュに保存する。すなわち、あまり更新されないデータ群に対するＲｅｄｕｃｅ処置およびＲｅｄｕｃｅＭｅｒｇｅ処理段階の部分的Ｒｅｄｕｃｅ結果をキャッシュに保存することにより、次回のＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理の際に再利用することができる。 According to this aspect, by adding the ReduceMerge process to the Map-Reduce process, the Reduce process is not performed at a time, but a partial Reduce result is generated and the Reduce process is performed step by step. In the Map-Reduce-ReduceMerge process expanded by adding the ReduceMerge process to the Map-Reduce process, the Reduce process for the group whose group update frequency is equal to or less than the threshold and the partial Reduce result in the ReduceMerge process stage are stored in the cache. That is, by storing the Reduce result for the data group that is not updated so much and the partial Reduce result of the ReduceMerge processing stage in the cache, it can be reused in the next Map-Reduce-ReduceMerge process.

ここで、Ｒｅｄｕｃｅ処置およびＲｅｄｕｃｅ−Ｍｅｒｇｅ処理段階の部分的Ｒｅｄｕｃｅ結果とは、処理対象である複数のグループの部分に対するＲｅｄｕｃｅ結果であって、Ｍａｐ−Ｒｅｄｕｃｅ処理において段階的に生成される。例えば、グループＡ、Ｂ、Ｃ、Ｄに対するＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理においては、グループＡ、Ｂ、Ｃ、Ｄの部分であるグループＡ、Ｂ、グループＣ、グループＤ等のＲｅｄｕｃｅ結果が部分的Ｒｅｄｕｃｅ結果である。また、グループＣおよびグループＤそれぞれのＲｅｄｕｃｅ結果（部分的Ｒｅｄｕｃｅ結果）から生成されるグループＣ、ＤのＲｅｄｕｃｅ結果も部分的Ｒｅｄｕｃｅ結果である。 Here, the partial reduction result of the Reduce process and the Reduce-Merge process stage is a Reduce result for a plurality of group parts to be processed, and is generated in stages in the Map-Reduce process. For example, in the Map-Reduce-ReduceMerge process for groups A, B, C, and D, the Reduce results of groups A, B, C, and D, which are parts of groups A, B, C, and D, are partially reduced. It is a result. In addition, the Reduce results of groups C and D generated from the Reduce results (partial Reduce results) of the groups C and D are also partial Reduce results.

また、本態様は、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理が、複数のグループのうち、グループ更新頻度が閾値以下であるグループを、グループ更新頻度に基づいて組み合わせ、それぞれに対して部分的Ｒｅｄｕｃｅ結果を生成するステップと、部分的Ｒｅｄｕｃｅ結果に対応するグループのグループ更新頻度に基づいて、部分的Ｒｅｄｕｃｅ結果の更新頻度である部分更新頻度を計算するステップと、部分的Ｒｅｄｕｃｅ結果の組み合わせを作成することができなくなるまで、部分的Ｒｅｄｕｃｅ結果を部分更新頻度に応じて組み合わせて、新たな部分的Ｒｅｄｕｃｅ結果を生成するステップと、を含む。 Also, in this aspect, the Map-Reduce-ReduceMerge process combines groups having a group update frequency equal to or less than a threshold among a plurality of groups based on the group update frequency, and generates a partial Reduce result for each group. Based on the step and the group update frequency of the group corresponding to the partial reduce result, the step of calculating the partial update frequency that is the update frequency of the partial reduce result and the partial reduce result cannot be created. Up to generating partial new Reduce results by combining the partial Reduce results according to the partial update frequency.

本態様によると、部分更新頻度に基づいて、部分的Ｒｅｄｕｃｅ結果を段階的に生成することができ、それらをキャッシュに保存することにより、キャッシュが有効であって、再利用することができるキャッシュ機構を構築することができる。部分更新頻度に基づいて、段階的に部分的Ｒｅｄｕｃｅ結果を生成することにより、データが更新になった場合に、データ更新による影響を受けない部分的Ｒｅｄｕｃｅ結果のキャッシュをそのまま利用することができる。 According to this aspect, it is possible to generate partial Reduce results in stages based on the partial update frequency, and by storing them in the cache, the cache is valid and can be reused. Can be built. By generating partial Reduce results stepwise based on the partial update frequency, when data is updated, a partial Reduce result cache that is not affected by the data update can be used as it is.

ここで、部分更新頻度とは、部分的Ｒｅｄｕｃｅ結果の更新頻度であり、すなわち、処理対象である複数のグループの部分に対するＲｅｄｕｃｅ結果の更新頻度である。また、部分的Ｒｅｄｕｃｅ結果を段階的に生成するとは、例えば、グループＡ、Ｂ、ＣのＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理においては、グループＡの部分的Ｒｅｄｕｃｅ結果、グループＢの部分的Ｒｅｄｕｃｅ結果、グループＣの部分的Ｒｅｄｕｃｅ結果、およびグループＡ、Ｂの部分的Ｒｅｄｕｃｅ結果といった、最終的なグループＡ、Ｂ、ＣのＲｅｄｕｃｅ結果が得られるまでの間の部分的Ｒｅｄｕｃｅ結果を段階的に生成することである。それにより、グループＣが更新された場合であっても、グループＣを含まないグループＡの部分的Ｒｅｄｕｃｅ結果、グループＢの部分的Ｒｅｄｕｃｅ結果、およびグループＡ、Ｂの部分的Ｒｅｄｕｃｅ結果のキャッシュはそのまま利用することができる。 Here, the partial update frequency is the update frequency of the partial Reduce result, that is, the update frequency of the Reduce result for a plurality of group parts to be processed. In addition, for example, in the Map-Reduce-ReduceMerge process of the groups A, B, and C, the partial Reduce result of the group A, the partial Reduce result of the group B, and the group C Generation of partial reduction results until the final reduction results of groups A, B, and C are obtained, such as partial reduction results of, and partial reduction results of groups A and B. . As a result, even when the group C is updated, the caches of the partial reduction result of the group A not including the group C, the partial reduction result of the group B, and the partial reduction result of the groups A and B remain unchanged. Can be used.

本発明は、分散ファイルシステムのＧＦＳ（Ｇｏｏｇｌｅ（登録商標）ＦｉｌｅＳｙｓｔｅｍ）や大規模分散データベースＢｉｇＴａｂｌｅ等の既存の技術と組み合わせることができ、そのように組み合わせた技術もまた、本発明の技術範囲に含まれる。また、本発明の技法は、キャッシュ機構を構築する方法の諸段階を、ＦＰＧＡ（現場でプログラム可能なゲートアレイ）、ＡＳＩＣ（特定用途向け集積回路）、これらと同等のハードウェアロジック素子、プログラム可能な集積回路、またはこれらの組み合わせが記憶し得るプログラムの形態、すなわちプログラム製品として提供し得る。具体的には、データ入出力、データバス、メモリバス、システムバス等を備えるカスタムＬＳＩ（大規模集積回路）の形態として、本発明に係るキャッシュ機構を構築する方法の実施手段、デバイス、組み込み装置等を提供でき、そのように集積回路に記憶されたプログラム製品の形態も、本発明の技術範囲に含まれる。 The present invention can be combined with existing technologies such as the distributed file system GFS (Google (registered trademark) File System) and the large-scale distributed database BigTable, and such a combination is also within the technical scope of the present invention. included. The technique of the present invention also provides the steps of constructing a cache mechanism, including FPGA (field programmable gate array), ASIC (application specific integrated circuit), equivalent hardware logic elements, and programmable. The integrated circuit, or a combination thereof, may be provided in the form of a program that can be stored, that is, as a program product. Specifically, means for implementing a cache mechanism according to the present invention as a form of a custom LSI (Large Scale Integrated Circuit) having a data input / output, data bus, memory bus, system bus, etc., device, and embedded apparatus The form of the program product stored in the integrated circuit is also included in the technical scope of the present invention.

本発明によれば、Ｍａｐ−Ｒｅｄｕｃｅプログラミングをキャッシュを有効活用することができるように拡張したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理において、キャッシュを有効活用することにより、従来のＭａｐ−Ｒｅｄｕｃｅ処理と比較し、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を低レイテンシで実現すること、すなわち、ＣＰＵコスト、通信コストを削減し、処理を効率化することが可能となる、更新頻度に基づく、Ｍａｐ−Ｒｅｄｕｃｅ処理に適用するキャッシュ機構の構築方法、およびシステムを提供することができる。 According to the present invention, in the Map-Reduce-ReduceMerge process, which is an extension of the Map-Reduce programming so that the cache can be effectively used, by effectively using the cache, the Map-Reduce programming is compared with the conventional Map-Reduce process. -Reduce-ReduceMerge processing can be realized with low latency, that is, the CPU cost and communication cost can be reduced, and the processing efficiency can be improved. The cache mechanism applied to Map-Reduce processing based on the update frequency Construction methods and systems can be provided.

以下、本発明の実施形態について図を参照しながら説明する。なお、これらはあくまでも一例であって、本発明の技術的範囲はこれらに限られるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. These are merely examples, and the technical scope of the present invention is not limited to these.

最初に、従来のＭａｐ−Ｒｅｄｕｃｅ処理において、キャッシュを利用する場合について説明する。上述した図１４のＭａｐ−Ｒｄｅｕｃｅ処理における、Ｒｅｄｕｃｅ結果のキャッシュ方法について、図１５を用いて説明する。 First, a case where a cache is used in conventional Map-Reduce processing will be described. A Reduce result caching method in the above-described Map-Rdeuce process of FIG. 14 will be described with reference to FIG.

図１５は、従来技術に係る、Ｍａｐ−Ｒｅｄｕｃｅ処理における、Ｒｅｄｕｃｅ結果のキャッシュ例を示す図である。Ｒｅｄｕｃｅワーカー１は、キー「ｄｏｇ」およびキー「ｃａｔ」に関してＲｅｄｕｃｅ処理を行い、Ｒｅｄｕｃｅ結果１１を生成するとともに、Ｒｅｄｕｃｅワーカー１のキャッシュ（図面においては、Ｃａｃｈｅと記載）１にＲｅｄｕｃｅ結果１１を保存する。同様に、Ｒｅｄｕｃｅワーカー２は、キー「ｆｏｘ」およびキー「ｒａｔ」のＲｅｄｕｃｅ処理を行い、Ｒｅｄｕｃｅ結果２１を生成するとともに、Ｒｅｄｕｃｅワーカー２のキャッシュ２にＲｅｄｕｃｅ結果２１を保存する。 FIG. 15 is a diagram illustrating a cache example of the Reduce result in the Map-Reduce process according to the related art. The Reduce worker 1 performs a Reduce process on the key “dog” and the key “cat”, generates a Reduce result 11, and saves the Reduce result 11 in the cache 1 of the Reduce worker 1 (denoted as Cache in the drawing) 1. . Similarly, the Reduce worker 2 performs the Reduce processing of the key “fox” and the key “rat”, generates the Reduce result 21, and stores the Reduce result 21 in the cache 2 of the Reduce worker 2.

このようにすることで、同じファイルに対する、２回目以降のキー「ｄｏｇ」、キー「ｃａｔ」、キー「ｆｏｘ」、およびキー「ｒａｔ」に関するＲｅｄｕｃｅ結果はキャッシュ１、２から取得することができる。すなわち、２回目以降のＭａｐ処理とＲｅｄｕｃｅ処理とを省略することができ、ＣＰＵコスト、通信コストを削減することが可能となる。 In this way, the Reduce results for the second and subsequent keys “dog”, “cat”, “fox”, and “rat” for the same file can be acquired from the caches 1 and 2. That is, the second and subsequent Map processing and Reduce processing can be omitted, and the CPU cost and communication cost can be reduced.

しかし、キャッシュに保存されているＲｅｄｕｃｅ結果は、Ｍａｐ−Ｒｅｄｕｃｅ処理対象のファイルに更新が加わり、Ｍａｐワーカーが生成するＭａｐ結果が更新になると、更新になったＭａｐ結果から得られるＲｅｄｕｃｅ結果と整合性が無く、利用できないため、無効化する必要がある。そのため、Ｒｅｄｕｃｅ結果をキャッシュに保存しても有効に利用できないという問題点がある。上述した図１５に示したＲｅｄｕｃｅ結果をキャッシュに保存したＭａｐ−Ｒｅｄｕｃｅ処理において、Ｍａｐ結果が更新になった場合について、図１６を用いて説明する。 However, the Reduce result stored in the cache is consistent with the Reduce result obtained from the updated Map result when the Map-Reduce processing target file is updated and the Map result generated by the Map worker is updated. Since it is not available, it must be disabled. Therefore, there is a problem that even if the Reduce result is stored in the cache, it cannot be used effectively. In the Map-Reduce process in which the Reduce result shown in FIG. 15 is stored in the cache, a case where the Map result is updated will be described with reference to FIG.

図１６は、従来技術に係る、キャッシュに保存されているＲｅｄｕｃｅ結果の利用時における、ファイル更新を示す図である。図１５から図１６への変更箇所は太字にて示す。図１６に示すｆｉｌｅ１のｆｏｘがｄｏｇに更新された場合、ｆｉｌｅ１を担当するＭａｐワーカー１のＭａｐ結果１０が更新され、更新されたワード「ｆｏｘ」および「ｄｏｇ」のＲｅｄｕｃｅ結果１１、２１は更新される。 FIG. 16 is a diagram illustrating file update when a Reduce result stored in a cache is used according to the related art. Changes from FIG. 15 to FIG. 16 are shown in bold. When the fox of file1 shown in FIG. 16 is updated to dog, the map result 10 of the map worker 1 in charge of file1 is updated, and the reduced results 11, 21 of the updated words “fox” and “dog” are updated. The

すなわち、更新されたワード「ｆｏｘ」と「ｄｏｇ」とに関して、Ｒｅｄｕｃｅワーカー１、２のキャッシュに保存した値と異なるＲｅｄｕｃｅ結果となるため、Ｒｅｄｕｃｅワーカー１、２は再度Ｒｅｄｕｃｅ処理をやり直すこととなり、Ｒｅｄｕｃｅワーカー１、２のキャッシュに保存した値を利用することはできない。このように、Ｍａｐ結果、すなわち、ファイルに更新が加わると、キャッシュに保存されているＲｅｄｕｃｅ結果は利用できず、無効化されてしまう。 That is, since the updated words “fox” and “dog” have a different Reduce result from the values stored in the caches of the Reduce workers 1 and 2, the Reduce workers 1 and 2 again perform the Reduce process, and the Reduce process is performed. Values stored in the caches of workers 1 and 2 cannot be used. As described above, when the Map result, that is, when the file is updated, the Reduce result stored in the cache cannot be used and is invalidated.

図１７は、従来技術に係る、実際のＭａｐ−Ｒｅｄｕｃｅ処理を示す図である。図１４から１７を用いて、１つのＭａｐワーカーが処理するファイルは１または２つとしてＭａｐ−Ｒｅｄｕｃｅ処理の概要について説明したが、実際のＭａｐ−Ｒｅｄｕｃｅ処理においては、図１７に示すように１つのＭａｐワーカーは大量のファイルを処理する。そのため、ファイルの更新は頻繁に行われ、Ｒｅｄｕｃｅワーカーのキャッシュは頻繁に無効化される。その結果、Ｍａｐ−Ｒｅｄｕｃｅ処理において、Ｒｅｄｕｃｅ結果をキャッシュしても有効に利用することはできない。 FIG. 17 is a diagram illustrating an actual Map-Reduce process according to the related art. 14 to 17, the outline of the Map-Reduce process has been described assuming that one Map worker processes one or two files. However, in the actual Map-Reduce process, as shown in FIG. Map workers process a large number of files. For this reason, the file is frequently updated, and the Reduce worker cache is frequently invalidated. As a result, in the Map-Reduce process, even if the Reduce result is cached, it cannot be used effectively.

次に、従来のＭａｐ−Ｒｅｄｕｃｅ処理においてキャッシュを有効に利用できないという問題点を解決する本発明の一実施形態について説明する。本発明の一実施形態においては、Ｍａｐ−Ｒｅｄｕｃｅ処理にＲｅｄｕｃｅＭｅｒｇｅ処理を追加し、Ｍａｐ−Ｒｅｄｕｃｅ処理を拡張したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を用いることで、キャッシュを有効に利用することができるキャッシュ機構を構築する。Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理は、Ｍａｐ−Ｒｅｄｕｃｅ処理とほぼ同等の機能を実現可能である。 Next, an embodiment of the present invention that solves the problem that the cache cannot be effectively used in the conventional Map-Reduce process will be described. In one embodiment of the present invention, a cache mechanism that can effectively use a cache by using a Map-Reduce-ReduceMerge process, which is obtained by adding a ReduceMerge process to a Map-Reduce process and expanding the Map-Reduce process, is provided. To construct. The Map-Reduce-ReduceMerge process can realize substantially the same function as the Map-Reduce process.

Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理は、Ｍａｐ処理と、Ｒｅｄｕｃｅ処理と、ＲｅｄｕｃｅＭｅｒｇｅ処理とからなる。背景技術にて説明したように、Ｍａｐ処理は、複数のファイルにＭａｐ関数を適用してＭａｐ結果を生成し、Ｒｅｄｕｃｅ処理は、Ｍａｐ結果にＲｅｄｕｃｅ関数を適用してＲｅｄｕｃｅ結果（部分的Ｒｅｄｕｃｅ結果）を生成する。 The Map-Reduce-ReduceMerge process includes a Map process, a Reduce process, and a ReduceMerge process. As described in the background art, the Map process applies a Map function to a plurality of files to generate a Map result, and the Reduce process applies a Reduce function to the Map result to reduce a result (partial Reduce result). Is generated.

ＲｅｄｕｃｅＭｅｒｇｅ処理は、同一キーに対する複数のＲｅｄｕｃｅ結果にＲｅｄｕｃｅＭｅｒｇｅ関数を適用して、新たなＲｅｄｕｃｅ結果（部分的Ｒｅｄｕｃｅ結果）を生成する。なお、ＲｅｄｕｃｅＭｅｒｇｅ処理は、全Ｍａｐ結果がＲｅｄｕｃｅ結果に反映された時点で、すなわち、処理対象である全ファイルに対するＲｅｄｕｃｅ結果が生成された時点で処理を終了する。ＲｅｄｕｃｅＭｅｒｇｅ処理はワーカーが行い、ＲｅｄｕｃｅＭｅｒｇｅ処理を行なっているワーカーをＲｅｄｕｃｅＭｅｒｇｅワーカーという。 The ReduceMerge process generates a new Reduce result (partial Reduce result) by applying the ReduceMerge function to a plurality of Reduce results for the same key. The ReduceMerge process ends when all Map results are reflected in the Reduce result, that is, when Reduce results for all files to be processed are generated. The ReduceMerge process is performed by a worker, and the worker performing the ReduceMerge process is referred to as a ReduceMerge worker.

ＲｅｄｕｃｅＭｅｒｇｅ関数とは、同じキーに対する複数のＲｅｄｕｃｅ結果をマージして、新たな１つのＲｅｄｕｃｅ結果を生成する関数である。ＲｅｄｕｃｅＭｅｒｇｅ関数の入力はＲｅｄｕｃｅ処理またはＲｅｄｕｃｅＭｅｒｇｅ処理により生成されたＲｅｄｕｃｅ結果である。 The ReduceMerge function is a function that merges a plurality of Reduce results for the same key to generate one new Reduce result. The input of the ReduceMerge function is a Reduce result generated by the Reduce process or the ReduceMerge process.

すなわち、ＲｅｄｕｃｅＭｅｒｇｅ処理により、Ｒｅｄｕｃｅ処理またはＲｅｄｕｃｅＭｅｒｇｅ処理により生成された同じキーに対する複数のＲｅｄｕｃｅ結果は段階的にマージされ、最終的に、キー毎に処理対象である全ファイルに対する１つのＲｅｄｕｃｅ結果が生成される。図１を用いて、ＲｅｄｕｃｅＭｅｒｇｅ処理について具体的に説明する。 That is, by ReduceMerge processing, a plurality of Reduce results for the same key generated by Reduce processing or ReduceMerge processing are merged in stages, and finally one Reduce result for all files to be processed is generated for each key. The The ReduceMerge process will be specifically described with reference to FIG.

図１は、本発明の一実施形態に係る、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理例を示す図である。本例において、Ｍａｐワーカーは、ファイルを更新頻度に分けて処理しないとする。Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理の場合も、従来のＭａｐ−Ｒｅｄｕｃｅ処理と同様、まず、入力された複数のファイル（ＩＤ＝１から３００００）から、ＭａｐワーカーとＲｅｄｕｃｅワーカーとによりＲｅｄｕｃｅ結果２００、２０１、２０２が生成される。 FIG. 1 is a diagram illustrating an example of a Map-Reduce-ReduceMerge process according to an embodiment of the present invention. In this example, it is assumed that the Map worker does not process the file according to the update frequency. In the case of the Map-Reduce-ReduceMerge process, as in the conventional Map-Reduce process, first, Reduce results 200, 201, and 202 are obtained from a plurality of input files (ID = 1 to 30000) by Map workers and Reduce workers. Is generated.

ＲｅｄｕｃｅＭｅｒｇｅワーカー（図面においては、ＲｅｄｕｃｅＭｅｒｇｅＷｏｒｋｅｒと記載）１は、Ｒｅｄｕｃｅ結果２００、２０１にＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、新たなＲｅｄｕｃｅ結果２１０を生成する。更に、ＲｅｄｕｃｅＭｅｒｇｅワーカー２は、Ｒｅｄｕｃｅ結果２０２、２１０にＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、新たなＲｅｄｕｃｅ結果２２０を生成する。 A ReduceMerge worker (denoted as ReduceMergeWorker in the drawing) 1 applies a ReduceMerge function to the Reduce results 200 and 201 to generate a new Reduce result 210. Further, the ReduceMerge worker 2 applies the ReduceMerge function to the Reduce results 202 and 210 to generate a new Reduce result 220.

上述したように、ＲｅｄｕｃｅＭｅｒｇｅワーカーが、Ｒｅｄｕｃｅ結果を段階的にマージすることにより、処理対象であるファイルに対する部分的Ｒｅｄｕｃｅ結果を生成することができる。この部分的Ｒｅｄｕｃｅ結果をキャッシュに保存することにより、キャッシュを有効に利用することができるキャッシュ機構を構築することができる。すなわち、従来のＭａｐ−Ｒｅｄｕｃｅ処理においては一度で行われていたＲｅｄｕｃｅ処理を、本発明の一実施形態に係る、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理においては、部分的Ｒｅｄｕｃｅ結果を生成し，段階的にＲｅｄｕｃｅ処理を行うことにより、従来のＲｅｄｕｃｅ処理ではできなかった、キャッシュを有効に利用することができるキャッシュ機構を構築できる処理とすることができる。 As described above, a ReduceMerge worker can generate a partial Reduce result for a file to be processed by merging the Reduce results in stages. By storing the partial Reduce result in the cache, a cache mechanism that can effectively use the cache can be constructed. That is, the Reduce process, which has been performed once in the conventional Map-Reduce process, is generated in the Map-Reduce-ReduceMerge process according to an embodiment of the present invention, and a partial Reduce result is generated in stages. By performing the processing, it is possible to construct a cache mechanism that can effectively use the cache, which was not possible with the conventional Reduce processing.

本発明の一実施形態に係る、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理におけるキャッシュ機構の構築方法には、大きく分けて３つの方法が含まれ、最初のＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理の際にキャッシュを生成する方法（キャッシュ生成方法）と、既にキャッシュが生成されているＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理において、データが更新された際にキャッシュを無効化する方法（キャッシュ無効化方法）と、キャッシュを利用してＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を行う方法（キャッシュ再利用方法）と、がある。以下、それぞれの方法について説明する。 A method for constructing a cache mechanism in Map-Reduce-ReduceMerge processing according to an embodiment of the present invention broadly includes three methods, and a method for generating a cache during the first Map-Reduce-ReduceMerge processing (Cache generation method), a method of invalidating the cache when the data is updated (cache invalidation method) in Map-Reduce-ReduceMerge processing in which the cache has already been generated, and Map- There is a method of performing Reduce-ReduceMerge processing (cache reuse method). Hereinafter, each method will be described.

なお、本発明の一実施形態において、キャッシュとは、サーバ内でデータやプログラムを記憶する記憶手段である、ＲＡＭやＲＯＭ等の半導体媒体、ハードディスク、デバイスドライバによりＯＳが割り当てた磁気媒体・電気媒体等である。 In one embodiment of the present invention, the cache is a storage means for storing data and programs in the server, a semiconductor medium such as RAM and ROM, a hard disk, and a magnetic medium / electric medium assigned by the OS by a device driver. Etc.

図２に、本発明の一実施形態に係る、キャッシュ生成方法を示すフローチャートである。図２に示すキャッシュ生成方法により生成されたキャッシュを用いることにより、２回目以降のＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理の一部を減らすことができる。 FIG. 2 is a flowchart illustrating a cache generation method according to an embodiment of the present invention. By using the cache generated by the cache generation method shown in FIG. 2, it is possible to reduce part of the second and subsequent Map-Reduce-ReduceMerge processes.

Ｓ１０：Ｍａｐワーカーが、担当するファイル（データ）を、ファイルのファイル更新頻度（データ更新頻度）毎にグループ分けをする。ステップＳ１０の処理について、ＩＤ＝１からＩＤ＝８までのファイルをファイル更新頻度毎にグループ分けする場合を例として、具体的に、説明する。 S10: The Map worker groups the file (data) in charge for each file update frequency (data update frequency) of the file. The process of step S10 will be specifically described by taking as an example a case where files with ID = 1 to ID = 8 are grouped for each file update frequency.

図３は、本発明の一実施形態に係る、ファイルのグループ分け例を示す図である。図３の例は、あるＭａｐワーカーが担当するＩＤ＝１からＩＤ＝８までのファイルを、ファイル更新頻度の高いグループＡとファイル更新頻度の低いグループＢとの２種類のグループ（Ｈｉｇｈ、Ｌｏｗ）に分類した場合である。各ファイルのファイル更新頻度は、一定時間における更新回数にて表し、例えば、１０秒に１回更新される場合には、１／１０とする。 FIG. 3 is a diagram showing an example of file grouping according to an embodiment of the present invention. In the example of FIG. 3, files of ID = 1 to ID = 8 handled by a Map worker are divided into two groups (High, Low), a group A with a high file update frequency and a group B with a low file update frequency It is a case where it classifies. The file update frequency of each file is represented by the number of updates in a fixed time. For example, when the file is updated once every 10 seconds, it is set to 1/10.

図３には、各ファイル対し、ファイル内容、ＩＤ、およびファイル更新頻度を一まとまりとして表している。ここで、グループＡとグループＢのファイル更新頻度の閾値を１／１５０とすると、ファイル更新頻度が１／１５０以上であるファイル、ＩＤ＝１，２，５，８はグループＡに分けられる。一方、ファイル更新頻度が１／１５０以下であるファイル、ＩＤ＝３，４，６，７はグループＢに分けられる。なお、ファイル更新頻度の閾値は任意であり、ファイル更新頻度を変えてテストし、キャッシュを有効に利用できる最適な値を見つけてもよい。 In FIG. 3, the file contents, ID, and file update frequency are collectively shown for each file. Here, assuming that the file update frequency threshold of group A and group B is 1/150, files with file update frequency of 1/150 or more, ID = 1, 2, 5, 8 are divided into group A. On the other hand, files whose file update frequency is 1/150 or less and ID = 3,4,6,7 are divided into group B. Note that the threshold value of the file update frequency is arbitrary, and testing may be performed while changing the file update frequency to find an optimum value that can effectively use the cache.

図２に戻って、
Ｓ２０：Ｍａｐワーカーは、ステップＳ１０にて分けたグループ毎のグループ更新頻度を計算する。 Returning to FIG.
S20: The Map worker calculates the group update frequency for each group divided in step S10.

例えば、上述した図３のグループＡのグループ更新頻度を計算する場合、グループＡを構成するＩＤ＝１、２、５、８それぞれのファイル更新頻度を加算する。すなわち、１／１０（ＩＤ＝１）＋３／２０（ＩＤ＝２）＋１／５（ＩＤ＝５）＋１／４０（ＩＤ＝８）＝１９／４０が、グループＡのグループ更新頻度となる。 For example, when calculating the group update frequency of the group A in FIG. 3 described above, the file update frequencies of ID = 1, 2, 5, and 8 constituting the group A are added. That is, 1/10 (ID = 1) +3/20 (ID = 2) +1/5 (ID = 5) +1/40 (ID = 8) = 19/40 is the group update frequency of group A.

Ｓ３０：Ｍａｐワーカーは、グループ毎にＭａｐ関数を適用し、それぞれについてＭａｐ結果を生成するとともに、対応するグループのグループ更新頻度を、Ｍａｐ結果の更新頻度とする。すなわち、ステップＳ２０にて計算されたグループ毎のグループ更新頻度が、対応するＭａｐ結果の更新頻度となる。図４を用いて、ステップＳ３０の処理について具体的に説明する。 S30: The Map worker applies the Map function for each group, generates a Map result for each group, and sets the group update frequency of the corresponding group as the update frequency of the Map result. That is, the group update frequency for each group calculated in step S20 is the update frequency of the corresponding Map result. The process of step S30 will be specifically described with reference to FIG.

図４は、本発明の一実施形態に係る、Ｍａｐワーカーの処理例を示す図である。図４の例は、ＩＤ＝１からＩＤ＝３００００までのファイルをＡからＦの６グループに分け、それぞれのグループについてＭａｐ関数を適用した場合である。なお、図４以下、説明省略のため、キー「ｄｏｇ」のＲｅｄｕｃｅ結果を求めるための記述とする。 FIG. 4 is a diagram illustrating a processing example of a Map worker according to an embodiment of the present invention. The example of FIG. 4 is a case where files with ID = 1 to ID = 30000 are divided into 6 groups A to F, and the Map function is applied to each group. In FIG. 4 and subsequent figures, a description for obtaining the Reduce result of the key “dog” is used to omit the description.

グループＡはＩＤ＝１からＩＤ＝１００００までのファイルのうち、所定の閾値よりも更新頻度が高いファイルのグループであり、一方、グループＢは所定の閾値よりも更新頻度の低いファイルのグループである。同様に、Ｃ、ＤグループはＩＤ＝１０００１からＩＤ＝２００００までのファイル、Ｅ、ＦグループはＩＤ＝２０００１からＩＤ＝３００００までのファイルを所定の閾値よりも更新頻度が高いか低いかによって分けられたグループである。 Group A is a group of files whose update frequency is higher than a predetermined threshold among files with ID = 1 to ID = 10000, while group B is a group of files whose update frequency is lower than a predetermined threshold. . Similarly, the groups C and D are divided into files with ID = 10001 to ID = 20000, and the groups E and F are divided into files with ID = 20000 to ID = 30000 depending on whether the update frequency is higher or lower than a predetermined threshold. Group.

Ｍａｐワーカー１は、グループＡにＭａｐ関数を適用し、Ｍａｐ結果１００を生成する。グループＡのグループ更新頻度が１／８であるとき、Ｍａｐ結果１００はＡグループのグループ更新頻度を引き継ぎ、１／８となる。同様に、Ｍａｐワーカー１は、グループＢに対してもＭａｐ関数を適用し、Ｍａｐ結果１０１を生成する。Ｍａｐ結果１０１はグループＢのグループ更新頻度を引き継ぎ、１／１０００となる。ＣからＦグループについても同様に、グループ毎にＭａｐ結果が生成され、合わせて、Ｍａｐ結果の更新頻度が決定される。 The Map worker 1 applies a Map function to the group A and generates a Map result 100. When the group update frequency of group A is 1/8, the Map result 100 takes over the group update frequency of group A and becomes 1/8. Similarly, the Map worker 1 applies the Map function to the group B and generates a Map result 101. The Map result 101 takes over the group update frequency of the group B and becomes 1/1000. Similarly, for the groups C to F, a Map result is generated for each group, and the update frequency of the Map result is determined together.

このように、Ｍａｐワーカーにて生成されたＭａｐ結果は、対応するグループのグループ更新頻度に基づいて、更新頻度が決定される。 As described above, the update frequency of the Map result generated by the Map worker is determined based on the group update frequency of the corresponding group.

図２に戻って、
Ｓ４０：Ｒｅｄｕｃｅワーカーは、１以上のＭａｐ結果に対し、Ｒｅｄｕｃｅ関数を適用し、Ｒｅｄｕｃｅ結果を生成する。ただし、複数のＲｅｄｕｃｅワーカーによって、全Ｍａｐ結果、つまり全グループからＲｅｄｕｃｅ結果が生成される。 Returning to FIG.
S40: The Reduce worker applies a Reduce function to one or more Map results to generate a Reduce result. However, all Map results, that is, Reduce results from all groups, are generated by a plurality of Reduce workers.

Ｓ５０：Ｒｅｄｕｃｅワーカーは、ステップＳ４０にて生成したＲｅｄｕｃｅ結果をキャッシュに保存する。なお、複数のＲｅｄｕｃｅワーカーそれぞれのキャッシュには、各Ｒｅｄｕｃｅワーカーが生成したＲｅｄｕｃｅ結果が保存される。また、Ｒｅｄｕｃｅワーカーは、キャッシュにＲｅｄｕｃｅ結果の元データであるグループの名前を元データ情報として保存する。更に、Ｒｅｄｕｃｅワーカーは、生成したＲｅｄｕｃｅ結果の更新頻度をその元データとなっているグループのグループ更新頻度の和として求めることができ、キャッシュにＲｅｄｕｃｅ結果の更新頻度を保存する。図５を用いて、ステップＳ４０の処理について具体的に説明する。 S50: The Reduce worker saves the Reduce result generated in Step S40 in the cache. It should be noted that the Reduce result generated by each Reduce worker is stored in the cache of each Reduce worker. Also, the Reduce worker stores the name of the group, which is the original data of the Reduce result, in the cache as the original data information. Furthermore, the Reduce worker can obtain the update frequency of the generated Reduce result as the sum of the group update frequencies of the group that is the original data, and stores the update frequency of the Reduce result in the cache. The process of step S40 will be specifically described with reference to FIG.

図５は、本発明の一実施形態に係る、Ｒｅｄｕｃｅワーカーによる処理例を示す図である。図５の例は、ＡからＦの６グループから得られたＲｅｄｕｃｅ結果が各Ｒｅｄｕｃｅワーカーのキャッシュに保存される場合である。 FIG. 5 is a diagram illustrating an example of processing by a Reduce worker according to an embodiment of the present invention. The example of FIG. 5 is a case where Reduce results obtained from six groups A to F are stored in the cache of each Reduce worker.

Ｒｅｄｕｃｅワーカー１は、Ｍａｐ結果１００を用いて得られたＲｅｄｕｃｅ結果１１０をキャッシュ１に保存する。また、Ｒｅｄｕｃｅワーカー１は、Ｒｅｄｕｃｅ結果１１０の元データであるＡグループを元データ情報としてキャッシュ１に保存する。Ｒｅｄｕｃｅ結果１１０の入力データであるＭａｐ結果１００は、Ａグループから得られた値だからである。また、Ｒｅｄｕｃｅワーカー１は、Ｒｅｄｕｃｅ結果１１０の元データであるＡグループのグループ更新頻度１／８を、Ｒｅｄｕｃｅ結果１１０の更新頻度としてキャッシュ１に保存する。Ｒｅｄｕｃｅ結果１１０の更新頻度は入力データであるＭａｐ結果１００の更新頻度であり、Ｍａｐ結果１０１の更新頻度はＡグループのグループ更新頻度だからである。 The Reduce worker 1 stores the Reduce result 110 obtained by using the Map result 100 in the cache 1. Also, the Reduce worker 1 stores the A group, which is the original data of the Reduce result 110, in the cache 1 as the original data information. This is because the Map result 100 that is input data of the Reduce result 110 is a value obtained from the A group. Also, the Reduce worker 1 stores the group update frequency 1/8 of the A group that is the original data of the Reduce result 110 in the cache 1 as the update frequency of the Reduce result 110. This is because the update frequency of the Reduce result 110 is the update frequency of the Map result 100 that is input data, and the update frequency of the Map result 101 is the group update frequency of the A group.

Ｒｅｄｕｃｅワーカー２は、Ｍａｐ結果１０１、１０３、１０５を用いて得られたＲｅｄｕｃｅ結果１１１をキャッシュ２に保存する。また、Ｒｅｄｕｃｅワーカー２は、Ｒｅｄｕｃｅ結果１１１の元データがＢ、Ｄ、Ｆグループであること元データ情報としてキャッシュ２に保存する。Ｒｅｄｕｃｅ結果１１１の入力データであるＭａｐ結果１０１、１０３、１０５それぞれはＢ、Ｄ、Ｆグループそれぞれから得られた値だからである。 The Reduce worker 2 stores the Reduce result 111 obtained using the Map results 101, 103, and 105 in the cache 2. The Reduce worker 2 stores the original data of the Reduce result 111 in the cache 2 as original data information indicating that the original data is the B, D, and F groups. This is because the Map results 101, 103, and 105, which are input data of the Reduce result 111, are values obtained from the B, D, and F groups, respectively.

また、Ｒｅｄｕｃｅワーカー２は、Ｒｅｄｕｃｅ結果１１１の元データであるＢ、Ｄ、Ｆグループの更新頻度の和１／１０００＋１／２０００＋１／４０００＝７／４０００を、Ｒｅｄｕｃｅ結果１１１の更新頻度としてキャッシュ２に保存する。Ｒｅｄｕｃｅ結果１１１の更新頻度は入力データであるＭａｐ結果１０１、１０３、１０５の更新頻度の和であり、Ｍａｐ結果１０１、１０３、１０５それぞれはＢ、Ｄ、Ｆグループそれぞれの更新頻度だからである。 Also, the Reduce worker 2 stores the sum of the update frequencies of the B, D, and F groups, which are the original data of the Reduce result 111, 1/1000 + 1/2000 + 1/4000 = 7/4000 in the cache 2 as the update frequency of the Reduce result 111. To do. This is because the update frequency of the Reduce result 111 is the sum of the update frequencies of the Map results 101, 103, and 105 that are input data, and the Map results 101, 103, and 105 are update frequencies of the B, D, and F groups, respectively.

なお、上述したＲｅｄｕｃｅワーカー２のように、Ｒｅｄｕｃｅワーカーが複数のＭａｐ結果に対して、Ｒｅｄｕｃｅ関数を適用する場合には、更新頻度が高くならないようにＭａｐ結果を選択する。Ｒｅｄｕｃｅワーカーのキャッシュを有効に利用するためである。更新頻度が高くならないようにＭａｐ結果を選択する方法については、図１１および１２を用いて後述する。 Note that, when the Reduce worker applies the Reduce function to a plurality of Map results like the Reduce worker 2 described above, the Map results are selected so that the update frequency does not increase. This is to effectively use the cache of the Reduce worker. A method for selecting the Map result so that the update frequency does not increase will be described later with reference to FIGS.

図２に戻って、
Ｓ６０：ＲｅｄｕｃｅＭｅｒｇｅワーカーは、複数のＲｅｄｕｃｅワーカーが生成したＲｅｄｕｃｅ結果に対し、ＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、新たなＲｅｄｕｃｅ結果を生成する。 Returning to FIG.
S60: The ReduceMerge worker applies a ReduceMerge function to the Reduce results generated by the plurality of Reduce workers, and generates a new Reduce result.

Ｓ７０：ＲｅｄｕｃｅＭｅｒｇｅワーカーは、ステップＳ６０にて生成したＲｅｄｕｃｅ結果をキャッシュに保存する。また、ＲｅｄｕｃｅＭｅｒｇｅワーカーは、キャッシュにＲｅｄｕｃｅ結果の元データであるグループの名前を元データ情報として保存する。更に、ＲｅｄｕｃｅＭｅｒｇｅワーカーは、生成したＲｅｄｕｃｅ結果の更新頻度をその元データとなっているグループのグループ更新頻度の和として求めることができ、キャッシュにＲｅｄｕｃｅ結果の更新頻度を保存する。 S70: The ReduceMerge worker saves the Reduce result generated in Step S60 in the cache. Also, the ReduceMerge worker stores the name of the group, which is the original data of the Reduce result, in the cache as the original data information. Further, the ReduceMerge worker can obtain the update frequency of the generated Reduce result as the sum of the group update frequencies of the group that is the original data, and stores the update frequency of the Reduce result in the cache.

Ｓ８０：マスタは、全てのグループのＲｅｄｕｃｅ結果をマージしたか判断し、ＹＥＳの場合には、処理を終了する。一方、ＮＯの場合には、ステップＳ６０へ戻る。なお、ステップＳ８０の全てのグループのＲｅｄｕｃｅ結果をマージしたか判断は、ＲｅｄｕｃｅＭｅｒｇｅワーカーが行ってもよい。図６を用いて、ステップＳ６０からＳ８０までの処理について具体的に説明する。 S80: The master determines whether the Reduce results of all groups have been merged. If YES, the process ends. On the other hand, if NO, the process returns to step S60. Note that the ReduceMerge worker may determine whether the Reduce results of all the groups in Step S80 have been merged. The process from step S60 to S80 will be specifically described with reference to FIG.

図６は、本発明の一実施形態に係る、ＲｅｄｕｃｅＭｅｒｇｅワーカーによる処理例を示す図である。なお、図５に示したＲｅｄｕｃｅワーカーの処理例の続きである。ＲｅｄｕｃｅＭｅｒｇｅワーカーは、複数のＲｅｄｕｃｅ結果に対し、ＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、最終的に全てのグループを含むＲｅｄｕｃｅ結果を生成する。ＲｅｄｕｃｅＭｅｒｇｅワーカー１は、Ｒｅｄｕｃｅワーカー１が生成したＲｅｄｕｃｅ結果１１０と、Ｒｅｄｕｃｅワーカー２が生成したＲｅｄｕｃｅ結果１１１とにＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、Ｒｅｄｕｃｅ結果１２０を生成する。ＲｅｄｕｃｅＭｅｒｇｅワーカー１は、Ｒｅｄｕｃｅ結果１２０をキャッシュ１１に保存する。 FIG. 6 is a diagram illustrating an example of processing by a ReduceMerge worker according to an embodiment of the present invention. It is a continuation of the processing example of the Reduce worker shown in FIG. The ReduceMerge worker applies the ReduceMerge function to a plurality of Reduce results, and finally generates a Reduce result including all the groups. The ReduceMerge worker 1 applies the ReduceMerge function to the Reduce result 110 generated by the Reduce worker 1 and the Reduce result 111 generated by the Reduce worker 2 to generate the Reduce result 120. The ReduceMerge worker 1 stores the Reduce result 120 in the cache 11.

また、キャッシュ１１には、ＲｅｄｕｃｅＭｅｒｇｅワーカー１は、Ｒｅｄｕｃｅ結果１２０の元データがＡ、Ｂ、Ｄ、Ｆグループであることが合わせて保存される。Ｒｅｄｕｃｅ結果１２０は、Ａグループを元データとするＲｅｄｕｃｅ結果１１０と、Ｂ、Ｄ、Ｆグループを元データとするＲｅｄｕｃｅ結果１１１とから生成されるので、Ａ、Ｂ、Ｄ、ＦグループがＲｅｄｕｃｅ結果１２０の元データであるからである。 Further, the ReduceMerge worker 1 stores in the cache 11 that the original data of the Reduce result 120 is the A, B, D, and F groups. Since the Reduce result 120 is generated from the Reduce result 110 having the A group as the original data and the Reduce result 111 having the B, D, and F groups as the original data, the A, B, D, and F groups have the Reduce result 120. This is because the original data.

更に、キャッシュ１１には、Ｒｅｄｕｃｅ結果１２０の更新頻度が保存される。Ｒｅｄｕｃｅ結果１２０の更新頻度は、Ｒｅｄｕｃｅ結果１１０の更新頻度１／８とＲｅｄｕｃｅ結果１１１の更新頻度７／４０００との和であり、１／８＋７／４０００＝５０７／４０００である。なお、Ｒｅｄｕｃｅ結果１２０の更新頻度は元データであるＡ、Ｂ、Ｄ、Ｆグループの更新頻度の和である。 Furthermore, the update frequency of the Reduce result 120 is stored in the cache 11. The update frequency of the Reduce result 120 is the sum of the update frequency 1/8 of the Reduce result 110 and the update frequency 7/4000 of the Reduce result 111, which is 1/8 + 7/4000 = 507/4000. The update frequency of the Reduce result 120 is the sum of the update frequencies of the A, B, D, and F groups that are the original data.

更に、ＲｅｄｕｃｅＭｅｒｇｅワーカー２は、Ｒｅｄｕｃｅワーカー３が生成したＲｅｄｕｃｅ結果１１２と、ＲｅｄｕｃｅＭｅｒｇｅワーカー１が生成したＲｅｄｕｃｅ結果１２０とにＲｅｄｕｃｅＭｅｒｇｅ関数を適用し、Ｒｅｄｕｃｅ結果１３０を生成する。また、ＲｅｄｕｃｅＭｅｒｇｅワーカー２は、Ｒｅｄｕｃｅ結果１３０をキャッシュ２１に保存する。 Further, the ReduceMerge worker 2 applies a ReduceMerge function to the Reduce result 112 generated by the Reduce worker 3 and the Reduce result 120 generated by the ReduceMerge worker 1 to generate a Reduce result 130. Also, the ReduceMerge worker 2 stores the Reduce result 130 in the cache 21.

このように、ＲｅｄｕｃｅＭｅｒｇｅワーカーは、他のＲｅｄｕｃｅＭｅｒｇｅワーカーにて生成されたＲｅｄｕｃｅ結果に対しても、Ｒｅｄｕｃｅ関数を適用することができ、最終的に全てのグループを含むＲｅｄｕｃｅ結果を生成がされるまで、ＲｅｄｕｃｅＭｅｒｇｅワーカーの処理は続く。つまり、複数のグループの部分的なＲｅｄｕｃｅ結果が複数生成され最終的にそれらがまとめられ、１つのＲｅｄｕｃｅ結果が生成される。このようにＲｅｄｕｃｅＭｅｒｇｅワーカーは、Ｒｅｄｕｃｅ結果を段階的に生成し、それらをキャッシュに保存する。 In this way, the ReduceMerge worker can apply the Reduce function to the Reduce results generated by other ReduceMerge workers until a Reduce result including all groups is finally generated. Processing of the ReduceMerge worker continues. That is, a plurality of partial Reduce results of a plurality of groups are generated and finally combined to generate one Reduce result. In this way, the ReduceMerge worker generates Reduce results in stages and stores them in the cache.

このように段階的に生成されたＲｅｄｕｃｅ結果それぞれをキャッシュに保存することにより、段階的なキャッシュ機構を構築することができる。その結果、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理においてキャッシュを有効に利用することができる。すなわち、ＲｅｄｕｃｅＭｅｒｇｅ処理をＭａｐ−Ｒｅｄｕｃｅ処理に加えることにより、キャッシュの効きやすいＭａｐ−Ｒｅｄｕｃｅ処理に拡張することができる。 By storing each Reduce result generated in stages in the cache, a staged cache mechanism can be constructed. As a result, the cache can be effectively used in the Map-Reduce-ReduceMerge process. In other words, by adding the ReduceMerge process to the Map-Reduce process, it is possible to extend the Map-Reduce process that is easy to cache.

キャッシュ機構を構築するシステムにおける各手段について、図２に示したキャッシュ生成処理フローを用いて説明する。分類手段は、処理対象であるファイルを受信すると、図２のステップＳ１０の処理を行う。頻度計算手段はファイルそれぞれの更新頻度を受信し、図２のステップＳ２０の処理を行う。結果生成手段は、分類手段にて分類されたグループを受信し、グループ毎に図２のステップＳ３０からＳ４０の処理を行う。キャッシュ手段は、結果生成手段にて作成されたＲｅｄｕｃｅ結果を受信し、図２のステップＳ５０の処理を行う。 Each means in the system for constructing the cache mechanism will be described using the cache generation processing flow shown in FIG. When the classifying unit receives the file to be processed, it performs the process of step S10 in FIG. The frequency calculation means receives the update frequency of each file, and performs the process of step S20 in FIG. The result generation means receives the group classified by the classification means, and performs the processing from step S30 to S40 in FIG. 2 for each group. The cache unit receives the Reduce result created by the result generation unit, and performs the process of step S50 in FIG.

また、結果生成手段は、結果生成手段にて作成されたＲｅｄｕｃｅ結果を受信し、図２のステップＳ６０の処理により、新たなＲｅｄｕｃｅ結果を生成することもできる。更に、キャッシュ手段は、結果生成手段にて新たに作成されたＲｅｄｕｃｅ結果を受信し、図２のステップＳ７０の処理を行う。 Further, the result generation means can receive the Reduce result created by the result generation means, and can generate a new Reduce result by the process of step S60 of FIG. Further, the cache unit receives the Reduce result newly created by the result generation unit, and performs the process of step S70 in FIG.

次に、上述したキャッシュ生成処理にて構築されたキャッシュ機構をファイルが更新された際に一部無効化するキャッシュ無効化方法について説明する。図７は、本発明の一実施形態に係る、キャッシュ無効化処理を示すフローチャートである。なお、キャッシュ無効化処理は、Ｍａｐワーカーが行っても、マスタが行ってもよい。以下、マスタがキャッシュ無効化処理を行う場合について説明する。 Next, a cache invalidation method for partially invalidating the cache mechanism constructed by the above-described cache generation processing when a file is updated will be described. FIG. 7 is a flowchart showing cache invalidation processing according to an embodiment of the present invention. Note that the cache invalidation process may be performed by a map worker or a master. Hereinafter, a case where the master performs cache invalidation processing will be described.

Ｓ１００：マスタは、ファイルの更新を検知する。具体的には、マスタは、ロボットやクローラと呼ばれるプログラムを常駐させ、ファイルの更新を検知する。 S100: The master detects a file update. Specifically, the master makes a program called a robot or a crawler resident and detects the update of the file.

Ｓ１１０：マスタは、ステップＳ１００で検知した、更新されたファイルを含むグループを特定する。 S110: The master specifies a group including the updated file detected in step S100.

Ｓ１２０：マスタは、ステップＳ１１０で特定したグループを元データの１つとして生成されたＲｅｄｕｃｅ結果のキャッシュを全て無効化する。具体的には、マスタは、ステップＳ１１０で特定したグループを、元データ情報に含むキャッシュを検索し、ステップＳ１１０で特定したグループを元データ情報に含むキャッシュを検知するとキャッシュを無効化する。 S120: The master invalidates all Reduce result caches generated using the group specified in step S110 as one of the original data. Specifically, the master searches the cache including the group specified in step S110 in the original data information, and invalidates the cache when detecting the cache including the group specified in step S110 in the original data information.

図８は、本発明の一実施形態に係る、キャッシュ無効化例を示す図である。図８を用いて、上述したキャッシュ無効化方法について、具体的に説明する。なお、図８は図５と図６とをマージした図である。 FIG. 8 is a diagram illustrating an example of cache invalidation according to an embodiment of the present invention. The above-described cache invalidation method will be specifically described with reference to FIG. FIG. 8 is a diagram obtained by merging FIG. 5 and FIG.

グループＣの内のファイルが更新されたとする。更新箇所は太字で表す。まず、マスタはファイルが更新されたことを検知し、更新されたファイルが含まれるグループ、ここではグループＣを特定する。マスタはファイルが更新されたグループがグループＣであることに基づいて、元データ情報にグループＣを含むキャッシュ３およびキャッシュ２１を無効化する。 Assume that a file in group C has been updated. Update locations are shown in bold. First, the master detects that a file has been updated, and identifies a group that includes the updated file, in this case, group C. The master invalidates the cache 3 and the cache 21 including the group C in the original data information based on the fact that the group whose file is updated is the group C.

また、個別ファイルの更新頻度が、そのファイルが属するグループ（Ｈｉｇｈ、またはＬｏｗ）の更新頻度の閾値よりも大きくなった場合、以下の手順によりグループの再構成とキャッシュの無効化を行う。 Also, when the update frequency of the individual file becomes larger than the update frequency threshold of the group (High or Low) to which the file belongs, group reconfiguration and cache invalidation are performed according to the following procedure.

図９は、本発明の一実施形態に係る、グループ再構成によるキャッシュ無効化の例を示す図である。図９のグループＡ、Ｂのうち、Ｂにあるデータが頻繁に更新されている場合について説明する。 FIG. 9 is a diagram illustrating an example of cache invalidation by group reconfiguration according to an embodiment of the present invention. The case where the data in B among groups A and B in FIG. 9 is frequently updated will be described.

まず、マスタは、頻繁に更新されているファイルを特定する。頻繁に更新されているか否かは、ファイル更新頻度が、ファイルをグループ分けした時の閾値を越えたか否かで判断する。マスタは特定したファイルの移動元のグループと移動先のグループとの情報を取得する。本例において、マスタは特定したファイルはグループＢに含まれているので、移動元のグループはグループＢである。グループＢに対応するファイル更新頻度の高いグループはグループＡであるので、移動先のグループはＡグループである。 First, the master specifies a file that is frequently updated. Whether or not the file is frequently updated is determined based on whether or not the file update frequency exceeds a threshold when the files are grouped. The master acquires information on the group of the identified file and the destination group. In this example, since the file specified by the master is included in group B, the source group is group B. Since the group with a high file update frequency corresponding to the group B is the group A, the destination group is the A group.

マスタは取得した情報に基づいて、キャッシュを無効化する。本例において、マスタは、グループＡ、Ｂを元データ情報に含むキャッシュ１、２、１１、２１を無効化する。ただし、この際、Ａ、Ｂ以外のグループに由来するキャッシュ３はそのまま保持される。 The master invalidates the cache based on the acquired information. In this example, the master invalidates the caches 1, 2, 11, and 21 including the groups A and B in the original data information. However, at this time, the cache 3 derived from a group other than A and B is held as it is.

次に、マスタは、特定したデータを移動元のグループであるグループＢから、移動先のグループであるグループＡに移す。 Next, the master moves the identified data from group B, which is the movement source group, to group A, which is the movement destination group.

更新頻度の低いファイルをＡからＢに移す場合も同様の手順により行える。なお、ファイルは同じサーバの中でグループを移動されるだけなので、サーバ間のスプリッティングの状況には変化がない。また、本例では、たくさんのキャッシュが無効化されてしまっているように見えるが、実際にはキャッシュの数はもっと多く、無効化は局所的である。 The same procedure can be used when transferring a file with a low update frequency from A to B. Since files are only moved within the same server, there is no change in the splitting status between servers. Also, in this example, it appears that many caches have been invalidated, but in reality there are more caches and invalidation is local.

次に、上述したキャッシュ生成方法およびキャッシュ無効化方法にて構築されたキャッシュ機構を利用したキャッシュ再利用方法について説明する。 Next, a cache reuse method using the cache mechanism constructed by the cache generation method and the cache invalidation method described above will be described.

図１０は、本発明の一実施形態に係る、キャッシュ機構を利用したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を示すフローチャートである。 FIG. 10 is a flowchart showing Map-Reduce-ReduceMerge processing using a cache mechanism according to an embodiment of the present invention.

Ｓ２００：マスタは、全ての有効なキャッシュを検索し、全グループのＲｅｄｕｃｅ結果が含まれるようにキャッシュを選択する。ただし、グループはかぶらないようにする必要がある。 S200: The master searches all valid caches and selects a cache so that all groups of Reduce results are included. However, the group should not be fogged.

Ｓ２１０：マスタは、選択された１以上のキャッシュの中に含まれないグループを担当するＭａｐワーカーに対し、Ｍａｐ処理を依頼する。 S210: The master requests the Map worker in charge of the group not included in the selected one or more caches to perform the Map process.

Ｓ２２０：Ｒｅｄｕｃｅワーカーは、ステップＳ２１０でマスタからＭａｐ処理を依頼されたＭａｐワーカーが作成したＭａｐ結果に基づいて、Ｒｅｄｕｃｅ結果を生成する。 S220: The Reduce worker generates a Reduce result based on the Map result created by the Map worker requested to perform the Map process from the master in Step S210.

Ｓ２３０：ステップＳ２００でマスタが選択したキャッシュのＲｅｄｕｃｅ結果と、ステップＳ２２０でＲｅｄｕｃｅワーカーが生成したＲｅｄｕｃｅ結果に対し、ＲｅｄｕｃｅＭｅｒｇｅ処理を行い、最終的なＲｅｄｕｃｅ結果を生成する。 S230: ReduceMerge processing is performed on the Reduce result of the cache selected by the master in Step S200 and the Reduce result generated by the Reduce worker in Step S220, and a final Reduce result is generated.

図１１は、本発明の一実施形態に係る、キャッシュ機構を利用したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理例を示す図である。例えば、キャッシュが図１１のようになっている場合、ＲｅｄｕｃｅＭｅｒｇｅワーカー１のキャッシュ１１は、Ｒｅｄｕｃｅワーカー１のキャッシュ１とＲｅｄｕｃｅワーカー２のキャッシュ２とを含んでいる。この場合、グループＡ、Ｂ、Ｄ、ＦのＲｅｄｕｓｅ結果のキャッシュとして、ＲｅｄｕｃｅＭｅｒｇｅワーカー１のキャッシュ１１、もしくは、Ｒｅｄｕｃｅワーカー１とＲｅｄｕｃｅワーカー２のそれぞれのキャッシュ１、２が選択される。図１１では、ＲｅｄｕｃｅＭｅｒｇｅワーカー１のキャッシュ１１を選択する。 FIG. 11 is a diagram showing an example of Map-Reduce-ReduceMerge processing using a cache mechanism according to an embodiment of the present invention. For example, when the cache is as shown in FIG. 11, the cache 11 of the ReduceMerge worker 1 includes the cache 1 of the Reduce worker 1 and the cache 2 of the Reduce worker 2. In this case, the caches of the ReduceMerge worker 1 or the caches 1 and 2 of the Reduce worker 1 and the Reduce worker 2 are selected as caches of the Reduce results of the groups A, B, D, and F, respectively. In FIG. 11, the cache 11 of the ReduceMerge worker 1 is selected.

次に、選択したキャッシュに含まれないグループに対して、ＭａｐワーカーがＭａｐ関数を適用し、Ｍａｐ結果を生成する。ＲｅｄｕｃｅＭｅｒｇｅワーカー１のキャッシュ１１には、グループＣ、Ｅを元データに含まないＲｅｄｕｃｅ結果が保存されているため、Ｍａｐワーカー２、３はＭａｐ関数を適用し、グループＣ、ＥそれぞれのＭａｐ結果１２、１４を生成する。 Next, the Map worker applies a Map function to a group not included in the selected cache, and generates a Map result. Since the Reduce results that do not include the groups C and E are stored in the cache 11 of the Reduce Merge worker 1, the Map workers 2 and 3 apply the Map function, and the Map results 12 and 12 of the groups C and E respectively. 14 is generated.

次に、Ｍａｐワーカーが生成したＭａｐ結果から、ＲｅｄｕｃｅワーカーがＲｅｄｕｃｅ結果を生成する。図１１では、Ｒｅｄｕｃｅワーカー３がＭａｐ結果１２、１４からＲｅｄｕｃｅ結果１１２を生成する。 Next, the Reduce worker generates a Reduce result from the Map result generated by the Map worker. In FIG. 11, the Reduce worker 3 generates a Reduce result 112 from the Map results 12 and 14.

最後に、選択したキャッシュのＲｅｄｕｃｅ結果と、Ｒｅｄｕｃｅワーカーが生成したＲｅｄｕｃｅ結果に対し、ＲｅｄｕｃｅＭｅｒｇｅ処理を行い、最終的なＲｅｄｕｃｅ結果を生成する。図１１では、ＲｅｄｕｃｅＭｅｒｇｅワーカー２が、ＲｅｄｕｃｅＭｅｒｇｅワーカー１のキャッシュ１１に保存されたＲｅｄｕｃｅ結果１２０と、Ｒｅｄｕｃｅワーカー３が生成したＲｅｄｕｃｅ結果１１２とから、Ｒｅｄｕｃｅ結果１３０を生成している。 Finally, a ReduceMerge process is performed on the Reduce result of the selected cache and the Reduce result generated by the Reduce worker to generate a final Reduce result. In FIG. 11, the ReduceMerge worker 2 generates a Reduce result 130 from the Reduce result 120 stored in the cache 11 of the ReduceMerge worker 1 and the Reduce result 112 generated by the Reduce worker 3.

図１１において、点線で表示される部分は、キャッシュを利用にしたことにより省略された処理を示している。図１１からわかるように、多くの処理はキャッシュを利用することにより、省略することができる。その結果、システム全体の負荷を軽減させること、および処理全体を従来のＭａｐ−Ｒｅｄｕｃｅ処理と比較し、低レイテンシで実現することが可能となる。 In FIG. 11, a portion displayed by a dotted line indicates processing omitted due to the use of the cache. As can be seen from FIG. 11, many processes can be omitted by using a cache. As a result, the load on the entire system can be reduced, and the entire processing can be realized with low latency compared to the conventional Map-Reduce processing.

（変形例１）
上記実施形態では、ＭａｐワーカーまたはＲｅｄｕｃｅワーカーを行うサーバは任意でよく、あるサーバのＭａｐワーカーにて作成されたＭａｐ結果を、同一サーバのＲｅｄｕｃｅワーカーが処理しても、別のサーバのＲｅｄｕｃｅワーカーが処理してもよい。しかし、あるＭａｐワーカーにて作成されたＭａｐ結果を同一サーバのＲｅｄｕｃｅワーカーが処理するほうが効率がよい。なぜならＭａｐ結果の受け渡しが発生しないからである。そこで、ＭａｐワーカーとＲｅｄｕｃｅワーカーの物理的な配置を最適化することにより、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理が実現可能となる。 (Modification 1)
In the above embodiment, a map worker or a reduce worker may be any server, and even if a map result created by a map worker of a server is processed by a reduce worker of the same server, a reduce worker of another server It may be processed. However, it is more efficient for the Reduce worker of the same server to process the Map result created by a certain Map worker. This is because the Map result is not passed. Therefore, by optimizing the physical arrangement of the Map worker and the Reduce worker, more efficient Map-Reduce-ReduceMerge processing can be realized.

（変形例２）
上記実施形態では、全てのＲｅｄｕｃｅ結果を更新頻度に関わらず、キャッシュに保存している。しかし、更新頻度の高いＲｅｄｕｃｅ結果のキャッシュが利用できる回数は、更新頻度の低いＲｅｄｕｃｅ結果のキャッシュに比べ少ないので、更新頻度が高いＲｅｄｕｃｅ結果を、キャッシュ対象外としてもよい。それにより、更新頻度の高いＲｅｄｕｃｅ結果をキャッシュする処理、キャッシュを無効化する処理を省略することができ、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 (Modification 2)
In the above embodiment, all Reduce results are stored in the cache regardless of the update frequency. However, since the number of times the Reduce result cache having a high update frequency can be used is smaller than that of the Reduce result cache having a low update frequency, the Reduce result having the high update frequency may be excluded from the cache target. As a result, processing for caching Reduce results with high update frequency and processing for invalidating the cache can be omitted, and more efficient Map-Reduce-ReduceMerge processing can be realized.

（変形例３）
上記実施形態では、Ｒｅｄｕｃｅ結果をキャッシュに保存したが、Ｒｅｄｕｃｅ結果同様にＭａｐ結果をキャッシュに保存してもよい。Ｍａｐ結果は、それを生成したＭａｐワーカーのキャッシュに保存される。更に、このとき、更新頻度の高いＭａｐ結果を対象外としてもよい。これにより、Ｒｅｄｕｃｅ結果のキャッシュを利用できない場合であっても、Ｍａｐ結果のキャッシュを利用できるので、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 (Modification 3)
In the above embodiment, the Reduce result is stored in the cache. However, the Map result may be stored in the cache in the same manner as the Reduce result. The Map result is stored in the cache of the Map worker that generated it. Further, at this time, the Map result having a high update frequency may be excluded. As a result, even when the Reduce result cache cannot be used, the Map result cache can be used, so that more efficient Map-Reduce-ReduceMerge processing can be realized.

（変形例４）
上記実施形態にて、キャッシュされていないグループの特定は、全てのキャッシュのグループを確認して行うが、キャッシュ位置を格納するインデックスを１サーバに配置することにより、キャッシュされていないグループの特定を低レイテンシで実現することができる。その結果、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 (Modification 4)
In the above embodiment, the uncached group is identified by checking all the cache groups. However, by placing an index for storing the cache position on one server, the uncached group is identified. It can be realized with low latency. As a result, more efficient Map-Reduce-ReduceMerge processing can be realized.

（変形例５）
上記実施形態にて、ファイル更新頻度によるグループ分けは、ファイル更新頻度の高いグループ（Ｈｉｇｈ）とファイル更新頻度の低いグループ（Ｌｏｗ）との２つのグループに分割する以外にも、任意の数に分割可能してもよい。ファイルをファイル更新頻度毎に細かくグループ分けすることにより、有効であるキャッシュを増やすことができ、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 (Modification 5)
In the above embodiment, the grouping based on the file update frequency is divided into an arbitrary number other than the two groups of the file update frequency group (High) and the file update frequency group (Low). It may be possible. By finely grouping files for each file update frequency, it is possible to increase the number of effective caches, and to realize more efficient Map-Reduce-ReduceMerge processing.

（変形例６）
上記実施形態にて、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）する組み合わせについては任意であったが、有効であるキャッシュをより多くするために、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）する組み合わせを最適化する処理を付け加えてもよい。 (Modification 6)
In the above embodiment, the combination for reducing (ReduceMerge) is arbitrary, but in order to increase the number of valid caches, processing for optimizing the combination for reducing (ReduceMerge) may be added.

図１２は、本発明の一実施形態に係る、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）する組み合わせを最適化する処理のフローチャートである。
Ｓ３００：マスタは、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）処理により生成されるＲｅｄｕｃｅ結果の更新頻度の閾値αを取得する。更新頻度の閾値は任意でよく、人間が決定してもよい。但し、閾値は１に近すぎないほうがよい。なぜならば、閾値が１に近いと、Ｒｅｄｕｃｅ結果は多くのグループに対する結果となるため、Ｒｅｄｕｃｅ結果のキャッシュが無効化される確率が上がり、キャッシュを有効に利用できないからである。 FIG. 12 is a flowchart of a process for optimizing a combination for Reduce (ReduceMerge) according to an embodiment of the present invention.
S300: The master acquires a threshold α of the update frequency of the Reduce result generated by the Reduce (ReduceMerge) process. The update frequency threshold may be arbitrary and may be determined by a human. However, the threshold should not be too close to 1. This is because when the threshold value is close to 1, the Reduce result is a result for many groups, so the probability that the Reduce result cache is invalidated increases and the cache cannot be used effectively.

Ｓ３１０：マスタは、Ｒｅｄｕｃｅワーカー（ＲｅｄｕｃｅＭｅｒｇｅワーカー）を１つ選択する。このとき、すでに処理対象であるファイルが決まっているＲｅｄｕｃｅワーカー（ＲｅｄｕｃｅＭｅｒｇｅワーカー）は除く。また、マスタは、グループ集合Ｇを空にする。グループ集合Ｇは、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）する組み合わせを作成する際に、その組み合わせを一時的に格納するために用いる。 S310: The master selects one Reduce worker (ReduceMerge worker). At this time, a Reduce worker (ReduceMerge worker) in which a file to be processed has already been determined is excluded. Also, the master empties the group set G. The group set G is used to temporarily store the combination when creating a combination for Reduce (ReduceMerge).

Ｓ３２０：マスタは、未選択のグループのうち、最もグループ更新頻度の低いグループ（以下、グループｇとする）を１つ選択する。 S320: The master selects one of the unselected groups with the lowest group update frequency (hereinafter referred to as group g).

Ｓ３３０：マスタは、ステップＳ３２０で選択したグループｇをグループ集合Ｇに加えて、グループ集合Ｇを入力としてＲｅｄｕｃｅ処理を行った場合に、そのＲｅｄｕｃｅ結果の更新頻度がαを越すか否か判断する。判断結果が、ＹＥＳの場合にはステップＳ３５０へ、ＮＯの場合にはステップＳ３４０へ処理を移す。 S330: When the master adds the group g selected in step S320 to the group set G and performs the Reduce process with the group set G as an input, the master determines whether the update frequency of the Reduce result exceeds α. If the determination result is YES, the process proceeds to step S350, and if the determination result is NO, the process proceeds to step S340.

Ｓ３４０：マスタは、ステップＳ３３０でＹＥＳと判断すると、グループ集合Ｇにグループｇを加えて、ステップＳ３２０へ戻る。このようにして、グループ集合Ｇから得られるＲｅｄｕｃｅ結果の更新頻度が閾値αを越えるまで、グループ集合Ｇにグループｇが加えられる。 S340: If the master determines YES in step S330, it adds group g to group set G and returns to step S320. In this way, the group g is added to the group set G until the update frequency of the Reduce result obtained from the group set G exceeds the threshold value α.

Ｓ３５０：マスタは、ステップＳ３３０でＮＯと判断すると、グループ集合Ｇが空であるか否か判断する。判断の結果がＹＥＳの場合にはステップＳ３６０へ、ＮＯの場合にはステップＳ３７０へ処理を移す。 S350: When the master determines NO in step S330, the master determines whether the group set G is empty. If the determination result is YES, the process proceeds to step S360; if the determination result is NO, the process proceeds to step S370.

Ｓ３６０：マスタは、グループ集合Ｇにグループｇを加える。このようにするのは、１つのグループで、それから得られるＲｅｄｕｃｅ結果の更新頻度が閾値αを越えるものは、他のグループとは組み合わせず、単体でＲｅｄｕｃｅ処理を行うためである。
Ｓ３６０：マスタは、ステップＳ３１０にて選択されたＲｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）ワーカーはグループ集合Ｇ内の全グループをＲｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）対象とする。 S360: The master adds group g to group set G. This is because, in one group, when the update frequency of the Reduce result obtained therefrom exceeds the threshold value α, the Reduce process is performed alone without being combined with other groups.
S360: The master uses the Reduce (ReduceMerge) worker selected in step S310 as a target of Reduce (ReduceMerge) for all groups in the group set G.

Ｓ３８０：マスタは、全てのグループが選択されたか、つまり、全てのグループがＲｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）ワーカーの処理対象となったか判断する。判断結果が、ＹＥＳの場合には処理は終了し、ＮＯの場合にはステップＳ３１０へ戻る。 S380: The master determines whether all groups have been selected, that is, whether all groups have been processed by a Reduce (ReduceMerge) worker. If the determination result is YES, the process ends. If the determination result is NO, the process returns to step S310.

このようにすることで、グループ更新頻度の低いグループはまとめられ、それらから１つのＲｅｄｕｃｅ結果が生成され、キャッシュに保存される。このようにすることで、Ｒｅｄｕｃｅ対象であるグループの組み合わせを最適化することができ、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 In this way, groups with low group update frequency are grouped together, and one Reduce result is generated from them and stored in the cache. By doing so, it is possible to optimize the combination of groups that are reduction targets, and it is possible to realize more efficient Map-Reduce-ReduceMerge processing.

図１３は、本発明の一実施形態に係る、ＲｅｄｕｃｅＭｅｒｇｅ処理のフローチャートである。
Ｓ４００：マスタは、閾値γとΔとを取得する。ここで、マスタは、ＲｅｄｕｃｅＭｅｒｇｅ処理により生成されるＲｅｄｕｃｅ結果の更新頻度α＝γとする。なお、γとΔとは任意でよく、人間が決定してもよい。
Ｓ４１０：マスタは、更新頻度αを入力して、図１３に示したＲｅｄｕｃｅする組み合わせを最適化する処理を行う。
Ｓ４２０：ステップＳ４１０にて決定された処理対象であるグループを入力として、それぞれのＲｅｄｕｃｅワーカーは、Ｒｅｄｕｃｅ処理を行う。 FIG. 13 is a flowchart of the ReduceMerge process according to an embodiment of the present invention.
S400: The master acquires threshold values γ and Δ. Here, the master sets the update frequency α = γ of the Reduce result generated by the ReduceMerge process. Note that γ and Δ may be arbitrary and may be determined by a human.
S410: The master inputs the update frequency α and performs a process of optimizing the combination to be reduced shown in FIG.
S420: Each Reduce worker performs a Reduce process with the group that is the processing target determined in Step S410 as an input.

Ｓ４３０：マスタは、ステップＳ４２０にて生成されたＲｅｄｕｃｅ結果が１つであるか判断する。判断結果が、ＮＯの場合にはステップＳ４４０へ処理を移す。一方、ＹＥＳの場合には処理は終了する。ＲｅｄｕｃｅＭｅｒｇｅ処理の終了条件は、全てのグループに対応するＲｅｄｕｃｅ結果が生成されたことである。ステップＳ４２０にて生成されたＲｅｄｕｃｅ結果が１つであるということは、その結果は全てのグループに対応するＲｅｄｕｃｅ結果であり、ＲｅｄｕｃｅＭｅｒｇｅ処理の終了条件を満たしているからである。 S430: The master determines whether there is one Reduce result generated in step S420. If the determination result is NO, the process proceeds to step S440. On the other hand, if YES, the process ends. The end condition of the ReduceMerge process is that Reduce results corresponding to all groups have been generated. The fact that there is only one Reduce result generated in step S420 is that the result is a Reduce result corresponding to all groups and satisfies the termination condition of the ReduceMerge process.

Ｓ４４０：マスタは、更新頻度α＝α＋Δとする。更新頻度を少しずつ大きくすることで、生成されるＲｅｄｕｃｅ結果の数を段階的に減らしていくことができる。すなわち、段階的にＲｅｄｕｃｅＭｅｒｇｅ処理を行うことができる。
Ｓ４５０：マスタは、更新頻度αを入力して、図１３に示したＲｅｄｕｃｅＭｅｒｇｅする組み合わせを最適化する処理を行う。
Ｓ４６０：ステップＳ４５０にて２つ以上のグループをＲｅｄｕｃｅＭｅｒｇｅ処理対象としたＲｅｄｕｃｅＭｅｒｇｅワーカーは、ＲｅｄｕｃｅＭｅｒｇｅ処理を行い、ステップＳ４３０へ処理を戻す。 S440: The master sets the update frequency α = α + Δ. By increasing the update frequency little by little, the number of Reduce results generated can be reduced in stages. That is, ReduceMerge processing can be performed step by step.
S450: The master inputs the update frequency α, and performs the process of optimizing the combination for ReduceMerge shown in FIG.
S460: The ReduceMerge worker that has made two or more groups subject to ReduceMerge processing in Step S450 performs ReduceMerge processing, and returns the processing to Step S430.

このようにして、Ｒｅｄｕｃｅ結果の更新頻度に応じて何段階かＲｅｄｕｃｅＭｅｒｇｅ処理を行うことにより、更新が頻繁に起こっても、キャッシュしたＲｅｄｕｃｅ結果を有効に利用することができるキャッシュ機構を構築することができる。その結果、より効率のよいＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理を実現可能となる。 In this way, by performing ReduceMerge processing in several stages according to the update frequency of the Reduce result, it is possible to construct a cache mechanism that can effectively use the cached Reduce result even if the update frequently occurs. it can. As a result, more efficient Map-Reduce-ReduceMerge processing can be realized.

以上、本発明を実施形態に則して説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態または実施例に記載されたものに限定されるものではない。 Although the present invention has been described based on the embodiment, the present invention is not limited to the above-described embodiment. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments or examples of the present invention. Is not to be done.

本発明の一実施形態に係る、Ｍａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理例を示す図である。It is a figure which shows the Map-Reduce-ReduceMerge process example based on one Embodiment of this invention. 本発明の一実施形態に係る、キャッシュ生成方法を示すフローチャートである。4 is a flowchart illustrating a cache generation method according to an embodiment of the present invention. 本発明の一実施形態に係る、ファイルのグループ分け例を示す図である。It is a figure which shows the example of grouping of the file based on one Embodiment of this invention. 本発明の一実施形態に係る、Ｍａｐワーカーの処理例を示す図である。It is a figure which shows the process example of the Map worker based on one Embodiment of this invention. 本発明の一実施形態に係る、Ｒｅｄｕｃｅワーカーによる処理例を示す図である。It is a figure which shows the example of a process by the Reduce worker based on one Embodiment of this invention. 本発明の一実施形態に係る、ＲｅｄｕｃｅＭｅｒｇｅワーカーによる処理例を示す図である。It is a figure which shows the example of a process by the ReduceMerge worker based on one Embodiment of this invention. 本発明の一実施形態に係る、キャッシュ無効化処理を示すフローチャートである。It is a flowchart which shows the cache invalidation process based on one Embodiment of this invention. 本発明の一実施形態に係る、キャッシュ無効化例を示す図である。It is a figure which shows the example of cache invalidation based on one Embodiment of this invention. 本発明の一実施形態に係る、グループ再構成によるキャッシュ無効化の例を示す図である。It is a figure which shows the example of the cache invalidation by group reconfiguration | reconstruction based on one Embodiment of this invention. 本発明の一実施形態に係る、キャッシュ機構を利用したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理のフローチャートである。It is a flowchart of Map-Reduce-ReduceMerge processing using a cache mechanism according to an embodiment of the present invention. 本発明の一実施形態に係る、キャッシュ機構を利用したＭａｐ−Ｒｅｄｕｃｅ−ＲｅｄｕｃｅＭｅｒｇｅ処理例を示す図である。It is a figure which shows the Map-Reduce-ReduceMerge process example using a cache mechanism based on one Embodiment of this invention. 本発明の一実施形態に係る、Ｒｅｄｕｃｅ（ＲｅｄｕｃｅＭｅｒｇｅ）するグループの最適化処理のフローチャートである。It is a flowchart of the optimization process of the group which performs Reduce (ReduceMerge) according to an embodiment of the present invention. 本発明の一実施形態に係る、ＲｅｄｕｃｅＭｅｒｇｅ処理のフローチャートである。6 is a flowchart of ReduceMerge processing according to an embodiment of the present invention. 従来技術に係る、Ｍａｐ−Ｒｅｄｕｃｅ処理概要を示す図である。It is a figure which shows the Map-Reduce process outline | summary based on a prior art. 従来技術に係る、Ｍａｐ−Ｒｅｄｕｃｅ処理における、Ｒｅｄｕｃｅ結果のキャッシュ例を示す図である。It is a figure which shows the cache example of a Reduce result in the Map-Reduce process based on a prior art. 従来技術に係る、キャッシュに保存されているＲｅｄｕｃｅ結果の利用時における、ファイル更新を示す図である。It is a figure which shows file update at the time of utilization of the Reduce result preserve | saved at the cache based on a prior art. 従来技術に係る、実際のＭａｐ−Ｒｅｄｕｃｅ処理を示す図である。It is a figure which shows the actual Map-Reduce process based on a prior art.

Claims

In a Map-Reduce processing system that executes Map processing and Reduce processing and distributes a plurality of data, a cache mechanism for the Reduce processing is constructed.
In the Map-Reduce-ReduceMerge processing system, in which the Reduce process is partially executed on the result of the Map process and a ReduceMerge process is added to process the partially processed result in stages.
Classifying the plurality of data into a plurality of groups based on a data update frequency that is an update frequency of each of the plurality of data;
Calculating a group update frequency, which is an update frequency of each of the plurality of groups, based on the data update frequency of data constituting each of the plurality of groups;
A step of generating the plurality of Map results by executing the Map processing for each of the plurality of groups having a group update frequency equal to or less than a preset threshold of the group update frequency;
Performing the Reduce process partially on the plurality of Map results to generate a plurality of partial Reduce results;
Performing the ReduceMerge process in a stepwise manner on the plurality of partial reduce results to generate a new partial reduce result;
Storing the partial Reduce result in a cache;
Including methods.

The method according to claim 1, further comprising storing information specifying a group corresponding to the partial reduction result in the cache when the partial reduction result is stored in the cache.

And calculating a partial update frequency that is an update frequency of the partial reduce result based on the group update frequency of the group corresponding to the partial reduce result.
The method according to claim 1, wherein when the partial reduction result is stored in a cache, the partial update frequency is stored in the cache together with the partial update frequency.

The step of generating the partial reduction result further comprises:
Combining the partial reduction results based on the partial update frequency, and newly generating the partial reduction results for each;
Repeating the step of newly generating the partial reduction result until no combination of the partial reduction results can be created;
The method of claim 3 comprising:

The step of generating a new said partial Reduce result is:
The sum of the partial update frequencies of the partial reduction results to be combined is less than or equal to a preset threshold value of the partial update frequency,
The repeating step includes
The method according to claim 4, wherein the step of newly generating the partial reduce result is repeated by gradually increasing a threshold value of the partial update frequency.

And identifying the group containing updated data in response to at least one of the plurality of data being updated;
Invalidating the partial Reduce result cache corresponding to the identified group;
The method of claim 1 comprising:

Furthermore, the data update frequency is remarkably different in response to the fact that the data update frequency of at least one of the plurality of data has changed significantly compared to when the plurality of data is divided into the plurality of groups. Changing the group to which the changed data belongs,
Invalidating the cache of the partial reduce result corresponding to the group whose data update frequency has changed significantly before the change and the group included after the change;
The method of claim 1 comprising:

Creating an index for the cache location of the partial reduce result;
Storing the created index in any one of the computers constituting the Map-Reduce processing system;
The method of claim 1 comprising:

The step of generating the Map result generates the plurality of Map results by executing the Map process for each of the plurality of groups in which the group update frequency exceeds the group update frequency threshold. The method of claim 1.

The method of claim 1, further comprising storing the plurality of Map results in a cache.

In a Map-Reduce processing system that executes Map processing and Reduce processing and distributes a plurality of data, a cache mechanism for the Reduce processing is constructed.
In the Map-Reduce-ReduceMerge processing system, in which the Reduce process is partially executed on the result of the Map process and a ReduceMerge process is added to process the partially processed result in stages.
Dividing the plurality of data into a plurality of groups based on a data update frequency that is an update frequency of each of the plurality of data;
Calculating a group update frequency, which is an update frequency of each of the plurality of groups, based on the data update frequency of data constituting each of the plurality of groups;
A step of generating the plurality of Map results by executing the Map processing for each of the plurality of groups having a group update frequency equal to or less than a preset threshold of the group update frequency;
Performing the Reduce process partially on the plurality of Map results to generate a plurality of partial Reduce results;
Performing the ReduceMerge process in a stepwise manner on the plurality of partial reduce results to generate a new partial reduce result;
Calculating a partial update frequency, which is an update frequency of the partial Reduce result, based on the group update frequency of the group corresponding to the partial Reduce result;
Storing the partial reduce result, information identifying a group corresponding to the partial reduce result, and an update frequency of the partial reduce result in a cache;
Including
Furthermore,
When the partial reduction result, information identifying a group corresponding to the partial reduction result, and the update frequency of the partial reduction result are stored in a cache,
Identifying the group containing updated data in response to at least one of the plurality of data being updated;
Invalidating the partial Reduce result cache corresponding to the identified group;
The data update frequency has changed significantly in response to the fact that the data update frequency of at least one of the plurality of data has changed significantly compared to when the plurality of data is divided into the plurality of groups. Changing the group to which the data belongs;
Invalidating the partial reduce result cache corresponding to the groups whose data update frequency has changed significantly before and after the group included in the change.

In a Map-Reduce processing system that executes Map processing and Reduce processing and distributes a plurality of data, a program for constructing a cache mechanism for the Reduce processing,
In the Map-Reduce-ReduceMerge processing system, in which the Reduce process is partially executed on the result of the Map process and a ReduceMerge process is added to process the partially processed result in stages.
In the computer constituting the Map-Reduce-ReduceMerge processing system,
Dividing the plurality of data into a plurality of groups based on a data update frequency that is an update frequency of each of the plurality of data;
Calculating a group update frequency, which is an update frequency of each of the plurality of groups, based on the data update frequency of data constituting each of the plurality of groups;
A step of generating the plurality of Map results by executing the Map processing for each of the plurality of groups having a group update frequency equal to or less than a preset threshold of the group update frequency;
Performing the Reduce process partially on the plurality of Map results to generate a plurality of partial Reduce results;
Performing the ReduceMerge process in a stepwise manner on the plurality of partial reduce results to generate a new partial reduce result;
Storing the partial Reduce result in a cache;
A program that executes

In a Map-Reduce processing system that executes Map processing and Reduce processing and distributes a plurality of data, a system for constructing a cache mechanism for the Reduce processing,
In the Map-Reduce-ReduceMerge processing system, in which the Reduce process is partially executed on the result of the Map process and a ReduceMerge process is added to process the partially processed result in stages.
Classifying means for dividing the plurality of data into a plurality of groups based on a data update frequency that is an update frequency of each of the plurality of data;
A frequency calculating means for calculating a group update frequency, which is an update frequency of each of the plurality of groups, based on the data update frequency of data constituting each of the plurality of groups;
Map processing means for generating a plurality of Map results by executing the Map processing for each of the plurality of groups whose group update frequency is equal to or less than a preset threshold of the group update frequency;
Reduce processing means for partially executing the Reduce process on the plurality of Map results to generate a plurality of partial Reduce results;
ReduceMerge processing means for executing the ReduceMerge process stepwise on the plurality of partial Reduce results and generating a new partial Reduce result; and
Cache means for storing the partial Reduce result in a cache;
A system comprising: