跳转到内容

计算机集群:修订间差异

维基百科,自由的百科全书
删除的内容 添加的内容
准备翻译
第1行: 第1行:
{{distinguish|data cluster|grid computing}}
{{noteTA
{{Redirect|Cluster computing|the journal|Cluster Computing (journal)}}
|G1=IT
[[File:MEGWARE.CLIC.jpg|thumb|Technicians working on a large [[Linux]] cluster at the [[开姆尼茨工业大学|Chemnitz University of Technology]], Germany]]
|1=zh-hans:集群; zh-hant:叢集;
[[File:Sun Microsystems Solaris computer cluster.jpg|thumb|Sun Microsystems [[Solaris Cluster]]]]
}}
'''计算机集群'''简称'''集群'''是一种计算机系统,它通过一组松散集成的[[计算机]][[软件]]和/或[[硬件]]连接起来高度紧密地协作完成计算工作。在某种意义上,他们可以被看作是一台计算机。集群系统中的单个计算机通常称为节点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。一般情况下集群计算机比单个计算机,比如[[工作站]]或超级计算机性能价格比要高得多。


A '''computer cluster''' is a set of loosely or tightly connected [[電子計算機|computers]] that work together so that, in many respects, they can be viewed as a single system. Unlike [[Grid computing|grid computer]]s, computer clusters have each [[Node (computer science)|node]] set to perform the same task, controlled and scheduled by software.
== 集群分类 ==
集群分为同构与异构两种,它们的区别在于:组成集群系统的计算机之间的体系结构是否相同。集群计算机按功能和结构可以分成以下几类:
* 高可用性集群 High-availability (HA) clusters
* 负载均衡集群 Load balancing clusters
* 高性能计算集群 High-performance ([[超级计算机|HPC]]) clusters
* 网格计算 Grid computing


The components of a cluster are usually connected to each other through fast [[局域网|local area network]]s, with each ''node'' (computer used as a server) running its own instance of an [[操作系统|operating system]]. In most circumstances, all of the nodes use the same hardware<ref>{{cite web|url=https://stackoverflow.com/questions/9723040/what-is-the-difference-between-cloud-grid-and-cluster|title=Cluster vs grid computing|website=[[Stack Overflow]]}}</ref>{{better source|date=June 2017}} and the same operating system, although in some setups (e.g. using [[Open Source Cluster Application Resources]] (OSCAR)), different operating systems can be used on each computer, or different hardware.<ref name=pcauthority>{{cite web|url=http://www.pcauthority.com.au/Feature/306972,weekend-project-build-your-own-supercomputer.aspx|title=Weekend Project: Build your own supercomputer|date=29 June 2012|first=Darien|last=Graham-Smith|website=PC & Tech Authority|access-date=2 June 2017}}</ref>
=== 高可用性集群 ===
一般是指当集群中有某个节点失效的情况下,其上的任务会自动转移到其他正常的节点上。还指可以将集群中的某节点进行离线维护再上线,该过程并不影响整个集群的运行。


Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.<ref>{{cite web |url=http://www.cc.gatech.edu/~bader/papers/ijhpca.html|title=Cluster Computing: Applications |last=Bader |first=David|authorlink=David A. Bader|date=May 2001|website=|publisher=[[Georgia Institute of Technology College of Computing|Georgia Tech College of Computing]]|first2=Robert|last2=Pennington|accessdate=2017-02-28}}</ref>
=== 负载均衡集群 ===
负载均衡集群运行时,一般通过一个或者多个前端负载均衡器,将工作负载分发到后端的一组服务器上,从而达到整个系统的高性能和高可用性。这样的计算机集群有时也被称为服务器群(Server Farm)。一般高可用性集群和负载均衡集群会使用类似的技术,或同时具有高可用性与负载均衡的特点。


Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance [[分布式计算|distributed computing]].{{citation needed|date=October 2014}} They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest [[supercomputer]]s in the world such as [[红杉 (超级电脑)|IBM's Sequoia]].<ref>{{cite web|url=https://www.telegraph.co.uk/technology/9338651/Nuclear-weapons-supercomputer-reclaims-world-speed-record-for-US.html|title=Nuclear weapons supercomputer reclaims world speed record for US|publisher=The Telegraph|date=18 Jun 2012|accessdate=18 Jun 2012}}</ref> Prior to the advent of clusters, single unit [[故障容許度|fault tolerant]] [[大型计算机|mainframes]] with [[Triple modular redundancy|modular redundancy]] were employed; but the lower upfront cost of clusters, and increased speed of network fabric has favoured the adoption of clusters. In contrast to high-reliability mainframes clusters are cheaper to scale out, but also have increased complexity in error handling, as in clusters error modes are not opaque to running programs.<ref>{{cite book |last1=Gray |first1=Jim |last2=Rueter |first2=Andreas |title=Transaction processing : concepts and techniques |date=1993 |publisher=Morgan Kaufmann Publishers |isbn=1558601902}}</ref>
[[LVS|Linux虚拟服务器(LVS)项目]]在Linux操作系统上提供了最常用的负载均衡软件。


==Basic concepts==
=== 高性能计算集群 ===
[[File:Beowulf.jpg|thumb|150px|A simple, home-built [[Beowulf cluster]].]]
高性能计算集群采用将计算任务分配到集群的不同计算节点而提高计算能力,因而主要应用在科学计算领域。比较流行的HPC采用Linux操作系统和其它一些免费软件来完成并行运算。这一集群配置通常被称为[[Beowulf集群]]。这类集群通常运行特定的程序以发挥HPC cluster的并行能力。这类程序一般应用特定的运行库,比如专为科学计算设计的[[消息传递接口MPI|MPI]]库。
The desire to get more computing power and better reliability by orchestrating a number of low-cost [[commercial off-the-shelf]] computers has given rise to a variety of architectures and configurations.


The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast [[局域网|local area network]].<ref name=nbis>{{cite conference|title=Network-Based Information Systems: First International Conference, NBIS 2007|ISBN=3-540-74572-6|page=375}}</ref> The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a [[单系统映象|single system image]] concept.<ref name=nbis />
HPC集群特别适合于在计算中各计算节点之间发生大量数据通讯的计算作业,比如一个节点的中间结果或影响到其它节点计算结果的情况。


Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as [[對等網路|peer to peer]] or [[网格计算|grid computing]] which also use many nodes, but with a far more [[分布式计算|distributed nature]].<ref name=nbis />
=== 网格计算 ===
网格计算或网格集群是一种与集群计算非常相关的技术。网格与传统集群的主要差别是网格是连接一组相关并不信任的计算机,它的运作更像一个计算公共设施而不是一个独立的计算机。还有,网格通常比集群支持更多不同类型的计算机集合。


A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast [[supercomputer]]. A basic approach to building a cluster is that of a [[贝奥武夫机群|Beowulf]] cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional [[超级计算机|high performance computing]]. An early project that showed the viability of the concept was the 133-node [[Stone Soupercomputer]].<ref name="sciam">{{Cite news |title= The Do-It-Yourself Supercomputer |work= [[科学美国人|Scientific American]] |author= William W. Hargrove, Forrest M. Hoffman and [[Thomas Sterling (computing)|Thomas Sterling]] |volume= 265 |number= 2 |pages= 72–79 |date= August 16, 2001 |url= http://www.sciam.com/article.cfm?id=the-do-it-yourself-superc |accessdate= October 18, 2011 }}</ref> The developers used [[Linux]], the [[Parallel Virtual Machine]] toolkit and the [[訊息傳遞介面|Message Passing Interface]] library to achieve high performance at a relatively low cost.<ref name="extreme">{{Cite news |title= Cluster Computing: Linux Taken to the Extreme |first1= William W.|last1=Hargrove|first2=Forrest M.|last2=Hoffman |work= Linux Magazine |year= 1999 |url= http://climate.ornl.gov/~forrest/linux-magazine-1999/ |accessdate= October 18, 2011}}</ref>
网格计算是针对有许多独立作业的工作任务作优化,在计算过程中作业间无需共享数据。网格主要服务于管理在独立执行工作的计算机间的作业分配。资源如存储可以被所有节点共享,但作业的中间结果不会影响在其他网格节点上作业的进展。


Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance. The [[TOP500]] organization's semiannual list of the 500 fastest [[supercomputer]]s often includes many clusters, e.g. the world's fastest machine in 2011 was the [[京 (超级计算机)|K computer]] which has a [[distributed memory]], cluster architecture.<ref>{{cite conference|first=Mitsuo|last=Yokokawa |display-authors=etal |title=The K computer: Japanese next-generation supercomputer development project|conference=International Symposium on Low Power Electronics and Design (ISLPED)|date=1–3 August 2011|pages=371–372|url=http://ieeexplore.ieee.org/document/5993668/|doi=10.1109/ISLPED.2011.5993668}}</ref>
== 集群技术特点 ==
#通过多台计算机完成同一个工作。达到更高的效率。
#两机或多机内容、工作过程等完全一样。如果一台死机,另一台可以起作用。


== 集群软件 ==
== History ==
{{Main|History of computer clusters}}
* [[Sun Grid Engine]]
{{See also|History of supercomputing}}
* SLURM([[天河一號]]的Job Scheduler)
[[File:SPEC-1 VAX 05.jpg|thumb|150px|A [[VAX]] 11/780, c. 1977]]
* JBoss Application Server
* Lander Vault
* Solaris Cluster
* Oracle Real Application Cluster RAC
* Drbd+Heartbeat


Greg Pfister has stated that clusters were not invented by any specific vendor but by customers who could not fit all their work on one computer, or needed a backup.<ref>{{cite book | last = Pfister | first = Gregory |title = In Search of Clusters | edition = 2nd | publisher = Prentice Hall PTR | location = Upper Saddle River, NJ | year = 1998 | page = 36 | isbn = 0-13-899709-8 }}</ref> Pfister estimates the date as some time in the 1960s. The formal engineering basis of cluster computing as a means of doing parallel work of any sort was arguably invented by [[吉恩·阿姆達爾|Gene Amdahl]] of [[IBM]], who in 1967 published what has come to be regarded as the seminal paper on parallel processing: [[阿姆达尔定律|Amdahl's Law]].
== 集群产品 ==


The history of early computer clusters is more or less directly tied into the history of early networks, as one of the primary motivations for the development of a network was to link computing resources, creating a de facto computer cluster.
== 參考文獻 ==
{{Reflist}}


The first production system designed as a cluster was the Burroughs [[B5700]] in the mid-1960s. This allowed up to four computers, each with either one or two processors, to be tightly coupled to a common disk storage subsystem in order to distribute the workload. Unlike standard multiprocessor systems, each computer could be restarted without disrupting overall operation.
{{-}}
{{并行计算}}


The first commercial loosely coupled clustering product was [[Datapoint|Datapoint Corporation's]] "Attached Resource Computer" (ARC) system, developed in 1977, and using [[ARCnet]] as the cluster interface. Clustering per se did not really take off until [[迪吉多|Digital Equipment Corporation]] released their [[VAXcluster]] product in 1984 for the [[OpenVMS|VAX/VMS]] operating system (now named as OpenVMS). The ARC and VAXcluster products not only supported parallel computing, but also shared [[文件系统|file system]]s and [[外部设备|peripheral]] devices. The idea was to provide the advantages of parallel processing, while maintaining data reliability and uniqueness. Two other noteworthy early commercial clusters were the [[Tandem Computers|''Tandem Himalayan'']] (a circa 1994 high-availability product) and the ''IBM S/390 Parallel Sysplex'' (also circa 1994, primarily for business use).
[[Category:并发计算]]

[[Category:分布式计算]]
Within the same time frame, while computer clusters used parallelism outside the computer on a commodity network, [[supercomputer]]s began to use them within the same computer. Following the success of the [[CDC 6600]] in 1964, the [[Cray 1]] was delivered in 1976, and introduced internal parallelism via [[并行向量处理机|vector processing]].<ref name=Hill41 >{{cite book|title=Readings in computer architecture|first1=Mark Donald|last1=Hill|first2=Norman Paul|last2=Jouppi|first3=Gurindar|last3=Sohi|year=1999|ISBN=978-1-55860-539-8|pages=41–48}}</ref> While early supercomputers excluded clusters and relied on [[共享内存|shared memory]], in time some of the fastest supercomputers (e.g. the [[京 (超级计算机)|K computer]]) relied on cluster architectures.

==Attributes of clusters==
[[File:Balanceamento de carga (NAT).jpg|thumb|A load balancing cluster with two servers and N user stations (Galician).]]
Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use a [[high-availability cluster|high-availability]] approach. Note that the attributes described below are not exclusive and a "computer cluster" may also use a high-availability approach, etc.

"[[负载均衡|Load-balancing]]" clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized.<ref name=Sloan>{{cite book|title=High Performance Linux Clusters|first=Joseph D.|last=Sloan|year=2004|ISBN=0-596-00570-9}}</ref> However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple [[循環制|round-robin method]] by assigning each new request to a different node.<ref name=Sloan />

Computer clusters are used for computation-intensive purposes, rather than handling [[I/O|IO-oriented]] operations such as web service or databases.<ref name=VECPAR >{{cite book|title=High Performance Computing for Computational Science - VECPAR 2004|first1=Michel|last1=Daydé|first2=Jack|last2=Dongarra|year=2005|ISBN=3-540-25424-2|pages=120–121}}</ref> For instance, a computer cluster might support [[计算机模拟|computational simulations]] of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach "[[超级计算机|supercomputing]]".

"[[High-availability cluster]]s" (also known as [[故障转移|failover]] clusters, or HA clusters) improve the availability of the cluster approach. They operate by having redundant [[节点 (电信网络)|nodes]], which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate [[单点故障|single points of failure]]. There are commercial implementations of High-Availability clusters for many operating systems. The [[Linux-HA]] project is one commonly used [[自由软件|free software]] HA package for the [[Linux]] operating system.

==Benefits==
<!-- This used to be a list. Work has been done since, but it's still incomplete. -->
Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance (''the ability for a system to continue working with a malfunctioning node'') allows for [[可扩展性|scalability]], and in high performance situations, low frequency of maintenance routines, resource consolidation(e.g. [[RAID]]), and centralized management. Advantages include enabling data recovery in the event of a disaster and providing parallel data processing and high processing capacity.<ref>{{cite web|url=http://www-03.ibm.com/systems/clusters/benefits.html|title=IBM Cluster System : Benefits|publisher=[[IBM]]|accessdate=8 September 2014|archive-url=https://web.archive.org/web/20160429022854/http://www-03.ibm.com/systems/clusters/benefits.html|archive-date=29 April 2016|dead-url=yes|df=}}</ref><ref>{{cite web|url=https://technet.microsoft.com/en-us/library/cc778629(v=ws.10).aspx|title=Evaluating the Benefits of Clustering|date=28 March 2003|publisher=[[微软|Microsoft]]|accessdate=8 September 2014|archive-url=https://web.archive.org/web/20160422092651/https://technet.microsoft.com/en-us/library/cc778629%28v%3Dws.10%29.aspx|archive-date=22 April 2016|dead-url=yes|df=}}</ref>

In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in the cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers.

When adding a new node to a cluster, reliability increases because the entire cluster does not need to be taken down. A single node can be taken down for maintenance, while the rest of the cluster takes on the load of that individual node.

If you have a large number of computers clustered together, this lends itself to the use of [[Clustered file system|distributed file systems]] and [[RAID]], both of which can increase the reliability and speed of a cluster.

==Design and configuration==
[[File:beowulf.png|thumb|240px|left|A typical Beowulf configuration.]]
One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares a dedicated network, is densely located, and probably has homogeneous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching [[网格计算|grid computing]].

In a [[Beowulf cluster]], the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.<ref name=VECPAR /> In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization.<ref name=VECPAR /> The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.<ref name=VECPAR />

A special purpose 144-node [[DEGIMA (computer cluster)|DEGIMA cluster]] is tuned to running astrophysical N-body simulations using the Multiple-Walk parallel treecode, rather than general purpose scientific computations.<ref name=Hamada>{{cite journal|first=Tsuyoshi|last=Hamada |display-authors=etal |year=2009|title=A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation|journal=Computer Science - Research and Development|volume=24|pp=21–31|doi=10.1007/s00450-009-0089-1}}</ref>

Due to the increasing computing power of each generation of [[電子遊戲機|game console]]s, a novel use has emerged where they are repurposed into [[High-performance computing]] (HPC) clusters. Some examples of game console clusters are [[PlayStation 3 cluster|Sony PlayStation clusters]] and [[微软|Microsoft]] [[Xbox (遊戲機)|Xbox]] clusters. Another example of consumer game product is the [[Nvidia Tesla Personal Supercomputer]] workstation, which uses multiple graphics accelerator processor chips. Besides game consoles, high-end graphics cards too can be used instead. The use of graphics cards (or rather their GPU's) to do calculations for grid computing is vastly more economical than using CPU's, despite being less precise. However, when using double-precision values, they become as precise to work with as CPU's and are still much less costly (purchase cost).<ref name=pcauthority />

Computer clusters have historically run on separate physical [[電子計算機|computer]]s with the same [[操作系统|operating system]]. With the advent of [[虛擬化|virtualization]], the cluster nodes may run on separate physical computers with different operating systems which are painted above with a virtual layer to look similar.<ref name=linuxjournal>{{cite web|url=http://www.linuxjournal.com/article/8812|title=Xen Virtualization and Linux Clustering, Part 1|date=12 Jan 2006|website=Linux Journal|first=Ryan|last=Mauer|access-date=2 Jun 2017}}</ref>{{citation needed|date=November 2013}}{{clarify|date=November 2013}} The cluster may also be virtualized on various configurations as maintenance takes place. An example implementation is [[Xen]] as the virtualization manager with [[Linux-HA]].<ref name="linuxjournal" />

==Data sharing and communication==

===Data sharing===
[[File:Nec-cluster.jpg|thumb|A [[日本電氣|NEC]] [[Nehalem微架構|Nehalem cluster]]]]
As the computer clusters were appearing during the 1980s, so were [[supercomputer]]s. One of the elements that distinguished the three classes at that time was that the early supercomputers relied on [[共享内存|shared memory]]. To date clusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it.

However, the use of a [[集群文件系统|clustered file system]] is essential in modern computer clusters.{{Citation needed|date=August 2013}} Examples include the [[IBM General Parallel File System]], Microsoft's [[Cluster Shared Volumes]] or the [[Oracle Cluster File System]].

===Message passing and communication===
{{Main|Message passing in computer clusters}}
Two widely used approaches for communication between cluster nodes are MPI ([[訊息傳遞介面|Message Passing Interface]]) and PVM ([[Parallel Virtual Machine]]).<ref name=Gehrke>{{cite book|title=Distributed services with OpenAFS: for enterprise and education|first1=Franco|last1=Milicchio|first2=Wolfgang Alexander|last2=Gehrke|year=2007|ISBN=9783540366348|pages=339–341|url=https://books.google.ca/books?id=_HtHG2Ca5AEC}}</ref>

PVM was developed at the [[橡树岭国家实验室|Oak Ridge National Laboratory]] around 1989 before MPI was available. PVM must be directly installed on every cluster node and provides a set of software libraries that paint the node as a "parallel virtual machine". PVM provides a run-time environment for message-passing, task and resource management, and fault notification. PVM can be used by user programs written in C, C++, or Fortran, etc.<ref name="Gehrke" /><ref name="Prabhu" />

MPI emerged in the early 1990s out of discussions among 40 organizations. The initial effort was supported by [[國防高等研究計劃署|ARPA]] and [[国家科学基金会|National Science Foundation]]. Rather than starting anew, the design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use [[TCP/IP协议族|TCP/IP]] and socket connections.<ref name=Gehrke /> MPI is now a widely available communications model that enables parallel programs to be written in languages such as [[C语言|C]], [[Fortran]], [[Python]], etc.<ref name=Prabhu >{{cite book|url=https://books.google.ca/books?id=evcgB7Qlix4C|title=Grid and Cluster Computing|last=Prabhu|first=C.S.R.|year=2008|isbn=8120334280|pages=109–112}}</ref> Thus, unlike PVM which provides a concrete implementation, MPI is a specification which has been implemented in systems such as [[MPICH]] and [[Open MPI]].<ref name="Prabhu" /><ref name=Gropp>{{Cite journal
|last1=Gropp |first1=William |last2=Lusk |first2=Ewing |last3=Skjellum |first3=Anthony
|year=1996|citeseerx = 10.1.1.102.9485
|title=A High-Performance, Portable Implementation of the MPI Message Passing Interface
|journal=Parallel Computing |ref=harv}}</ref>

==Cluster management==
[[File:Cubieboard HADOOP cluster.JPG|thumb|Low-cost and low energy tiny-cluster of [[Cubieboard]]s, using [[Apache Hadoop]] on [[Lubuntu]]]]
One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes.<ref name=patter641 >{{cite book|title=Computer Organization and Design|first1=David A.|last1=Patterson|first2=John L.|last2=Hennessy|year=2011|ISBN=0-12-374750-3|pages=641–642}}</ref> In some cases this provides an advantage to [[共享内存|shared memory architecture]]s with lower administration costs.<ref name=patter641 /> This has also made [[虛擬機器|virtual machine]]s popular, due to the ease of administration.<ref name=patter641 />

===Task scheduling===
When a large multi-user cluster needs to access very large amounts of data, [[调度 (计算机)|task scheduling]] becomes a challenge. In a heterogeneous CPU-GPU cluster with a complex application environment, the performance of each job depends on the characteristics of the underlying cluster. Therefore, mapping tasks onto CPU cores and GPU devices provides significant challenges.<ref name=Shira /> This is an area of ongoing research; algorithms that combine and extend [[MapReduce]] and [[Apache Hadoop|Hadoop]] have been proposed and studied.<ref name=Shira >{{cite conference|author=K. Shirahata |display-authors=etal |title=Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters|conference=Cloud Computing Technology and Science (CloudCom) |date=30 Nov – 3 Dec 2010|pages=733–740|ISBN=978-1-4244-9405-7|url=http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5708524|doi=10.1109/CloudCom.2010.55}}</ref>

===Node failure management===
When a node in a cluster fails, strategies such as "[[Fencing (computing)|fencing]]" may be employed to keep the rest of the system operational.<ref name=ARob>{{cite web|first=Alan|last=Robertson|title=Resource fencing using STONITH|website=IBM Linux Research Center|year=2010|url=ftp://ftp.telecom.uff.br/pub/linux/HA/ResourceFencing_Stonith.pdf}}</ref>{{better source|date=June 2017|reason=The link is dead, and a quick search does not turn up anything similar.}}<ref name=suncl >{{cite book|title=Sun Cluster environment: Sun Cluster 2.2|first1=Enrique|last1=Vargas|first2=Joseph|last2=Bianco|first3=David|last3=Deeths|year=2001|ISBN=9780130418708|page=58|url=https://books.google.ca/books?id=oRjWAZS5iZMC|publisher=Prentice Hall Professional}}</ref> Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks.<ref name=ARob />

The [[STONITH]] method stands for "Shoot The Other Node In The Head", meaning that the suspected node is disabled or powered off. For instance, ''power fencing'' uses a power controller to turn off an inoperable node.<ref name=ARob />

The ''resources fencing'' approach disallows access to resources without powering off the node. This may include ''persistent reservation fencing'' via the [[SCSI3]], fibre channel fencing to disable the [[光纤通道|fibre channel]] port, or [[网络块设备|global network block device]] (GNBD) fencing to disable access to the GNBD server.

==Software development and administration==

===Parallel programming===
Load balancing clusters such as web servers use cluster architectures to support a large number of users and typically each user request is routed to a specific node, achieving [[任务并行|task parallelism]] without multi-node cooperation, given that the main goal of the system is providing rapid user access to shared data. However, "computer clusters" which perform complex computations for a small number of users need to take advantage of the parallel processing capabilities of the cluster and partition "the same computation" among several nodes.<ref name=Blum />

[[Automatic parallelization]] of programs continues to remain a technical challenge, but [[parallel programming model]]s can be used to effectuate a higher [[degree of parallelism]] via the simultaneous execution of separate portions of a program on different processors.<ref name=Blum >{{cite book|title=Computer Science: The Hardware, Software and Heart of It|first1=Alfred V.|last1=Aho|first2=Edward K.|last2=Blum|year=2011|ISBN=1-4614-1167-X|pages=156–166|url=https://books.google.com/books?id=S7QU9RRLYIYC}}</ref><ref>{{cite book|title=Parallel Programming: For Multicore and Cluster Systems|first1=Thomas|last1=Rauber|first2=Gudula|last2=Rünger|year=2010|ISBN=3-642-04817-X|pages=94–95|url=https://books.google.com/books?id=wWogxOmA3wMC}}</ref>

===Debugging and monitoring===
The development and debugging of parallel programs on a cluster requires parallel language primitives as well as suitable tools such as those discussed by the ''High Performance Debugging Forum'' (HPDF) which resulted in the HPD specifications.<ref name=Prabhu /><ref name=iosp>{{cite journal|title=A Debugging Standard for High-performance computing|first1=Joan M.|last1=Francioni|author-link2=Cherri M. Pancake|first2=Cherri M.|last2=Pancake|journal=Scientific Programming|volume=8|issue=2|date=April 2000|url=http://dl.acm.org/citation.cfm?id=1239906|doi=10.1155/2000/971291|publisher=IOS Press|issn=1058-9244|pages=95–108|location=[[阿姆斯特丹|Amsterdam]], [[荷兰|Netherlands]]}}</ref>
Tools such as [[Rogue Wave Software|TotalView]] were then developed to debug parallel implementations on computer clusters which use [[訊息傳遞介面|MPI]] or [[Parallel Virtual Machine|PVM]] for message passing.

The [[加利福尼亞大學柏克萊分校|Berkeley]] NOW (Network of Workstations) system gathers cluster data and stores them in a database, while a system such as PARMON, developed in India, allows for the visual observation and management of large clusters.<ref name=Prabhu />

[[Application checkpointing]] can be used to restore a given state of the system when a node fails during a long multi-node computation.<ref name=sloot >{{cite conference|title=Computational Science-- ICCS 2003: International Conference|editor-first=Peter|editor-last=Sloot|year=2003|ISBN=3-540-40195-4|pages=291–292}}</ref> This is essential in large clusters, given that as the number of nodes increases, so does the likelihood of node failure under heavy computational loads. Checkpointing can restore the system to a stable state so that processing can resume without having to recompute results.<ref name=sloot />

==Some implementations==
The GNU/Linux world supports various cluster software; for application clustering, there is [[distcc]], and [[MPICH]]. [[Linux虚拟服务器|Linux Virtual Server]], [[Linux-HA]] - director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. [[MOSIX]], [[LinuxPMI]], [[Kerrighed]], [[OpenSSI]] are full-blown clusters integrated into the [[内核|kernel]] that provide for automatic process migration among homogeneous nodes. [[OpenSSI]], [[openMosix]] and [[Kerrighed]] are [[单系统映象|single-system image]] implementations.

[[Microsoft Windows]] computer cluster Server 2003 based on the [[Windows Server]] platform provides pieces for High Performance Computing like the Job Scheduler, MSMPI library and management tools.

[[gLite]] is a set of middleware technologies created by the [[Enabling Grids for E-sciencE]] (EGEE) project.

[[Slurm工作调度工具|slurm]] is also used to schedule and manage some of the largest supercomputer clusters (see top500 list).

==Other approaches==
Although most computer clusters are permanent fixtures, attempts at [[flash mob computing]] have been made to build short-lived clusters for specific computations. However, larger-scale [[volunteer computing]] systems such as [[BOINC]]-based systems have had more followers.

== See also ==
{| cellspacing="5" cellpadding="5" border="0" width="60%"
|-
|valign="top" width="30%"|
''Basic concepts''
:* [[Clustered file system]]
:* [[Heartbeat private network]]
:* [[High-availability cluster]]
:* [[Single system image]]
:* [[对称多处理|Symmetric multiprocessing]]
''Distributed computing''
:* [[分布式计算|Distributed computing]]
:* [[分散式檔案系統|Distributed data store]]
:* [[分布式操作系统|Distributed operating system]]
:* [[分布式共享存储处理机|Distributed shared memory]]
|valign="top" width="30%"|
''Specific systems''
:* [[DEGIMA (computer cluster)]]
:* [[京 (超级计算机)|K computer]]
:* [[Microsoft Cluster Server]]
:* [[红帽集群套件|Red Hat Cluster Suite]]
:* [[Rocks Cluster Distribution]]
:* [[Solaris Cluster]]
:* [[Veritas Cluster Server]]

''Computer farms''
:* [[Compile farm]]
:* [[著色農場|Render farm]]
:* [[服务器农场|Server farm]]
|}

== References ==
{{Reflist|30em}}

== Further reading ==

* {{cite arxiv|first=Mark|last=Baker |display-authors=etal |title=Cluster Computing White Paper|arxiv=cs/0004014|date=11 Jan 2001}}
* {{cite book|first1=Evan|last1=Marcus|first2=Hal|last2=Stern|title=Blueprints for High Availability: Designing Resilient Distributed Systems|publisher=John Wiley & Sons|ISBN=0-471-35601-8}}
* {{cite book|first1=Greg|last1=Pfister|title=In Search of Clusters|publisher=Prentice Hall|ISBN=0-13-899709-8}}
* {{cite book|editor-first=Rajkumar|editor-last=Buyya|title=High Performance Cluster Computing: Architectures and Systems|volume=1|ISBN=0-13-013784-7|publisher=Prentice Hall|location=NJ, USA|year=1999}}
* {{cite book|editor-first=Rajkumar|editor-last=Buyya|title=High Performance Cluster Computing: Architectures and Systems|volume=2|ISBN=0-13-013785-5|publisher=Prentice Hall|location=NJ, USA|year=1999}}

== External links ==
{{Commons}}
* [https://www.ieeetcsc.org/ IEEE Technical Committee on Scalable Computing (TCSC)]
* [http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.rsct.doc%2Frsctbooks.html Reliable Scalable Cluster Technology, IBM]
* [https://www.ibm.com/developerworks/wikis/display/tivoli/Tivoli+System+Automation Tivoli System Automation Wiki]
* [https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43438.pdf Large-scale cluster management at Google with Borg], April 2015, by Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune and John Wilkes

{{Parallel Computing|state=collapsed}}

{{DEFAULTSORT:Cluster (Computing)}}
[[Category:叢集計算| ]]
[[Category:叢集計算| ]]
[[Category:并行计算]]
[[Category:并发计算]]
[[Category:超級電腦|*Computer cluster]]
[[Category:局域网]]
[[Category:電腦的類別]]
[[Category:Fault-tolerant computer systems]]

2019年2月7日 (四) 08:51的版本

Technicians working on a large Linux cluster at the Chemnitz University of Technology, Germany
Sun Microsystems Solaris Cluster

A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.

The components of a cluster are usually connected to each other through fast local area networks, with each node (computer used as a server) running its own instance of an operating system. In most circumstances, all of the nodes use the same hardware[1][需要較佳来源] and the same operating system, although in some setups (e.g. using Open Source Cluster Application Resources (OSCAR)), different operating systems can be used on each computer, or different hardware.[2]

Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.[3]

Computer clusters emerged as a result of convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance distributed computing.[來源請求] They have a wide range of applicability and deployment, ranging from small business clusters with a handful of nodes to some of the fastest supercomputers in the world such as IBM's Sequoia.[4] Prior to the advent of clusters, single unit fault tolerant mainframes with modular redundancy were employed; but the lower upfront cost of clusters, and increased speed of network fabric has favoured the adoption of clusters. In contrast to high-reliability mainframes clusters are cheaper to scale out, but also have increased complexity in error handling, as in clusters error modes are not opaque to running programs.[5]

Basic concepts

A simple, home-built Beowulf cluster.

The desire to get more computing power and better reliability by orchestrating a number of low-cost commercial off-the-shelf computers has given rise to a variety of architectures and configurations.

The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast local area network.[6] The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept.[6]

Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer to peer or grid computing which also use many nodes, but with a far more distributed nature.[6]

A computer cluster may be a simple two-node system which just connects two personal computers, or may be a very fast supercomputer. A basic approach to building a cluster is that of a Beowulf cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional high performance computing. An early project that showed the viability of the concept was the 133-node Stone Soupercomputer.[7] The developers used Linux, the Parallel Virtual Machine toolkit and the Message Passing Interface library to achieve high performance at a relatively low cost.[8]

Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance. The TOP500 organization's semiannual list of the 500 fastest supercomputers often includes many clusters, e.g. the world's fastest machine in 2011 was the K computer which has a distributed memory, cluster architecture.[9]

History

A VAX 11/780, c. 1977

Greg Pfister has stated that clusters were not invented by any specific vendor but by customers who could not fit all their work on one computer, or needed a backup.[10] Pfister estimates the date as some time in the 1960s. The formal engineering basis of cluster computing as a means of doing parallel work of any sort was arguably invented by Gene Amdahl of IBM, who in 1967 published what has come to be regarded as the seminal paper on parallel processing: Amdahl's Law.

The history of early computer clusters is more or less directly tied into the history of early networks, as one of the primary motivations for the development of a network was to link computing resources, creating a de facto computer cluster.

The first production system designed as a cluster was the Burroughs B5700 in the mid-1960s. This allowed up to four computers, each with either one or two processors, to be tightly coupled to a common disk storage subsystem in order to distribute the workload. Unlike standard multiprocessor systems, each computer could be restarted without disrupting overall operation.

The first commercial loosely coupled clustering product was Datapoint Corporation's "Attached Resource Computer" (ARC) system, developed in 1977, and using ARCnet as the cluster interface. Clustering per se did not really take off until Digital Equipment Corporation released their VAXcluster product in 1984 for the VAX/VMS operating system (now named as OpenVMS). The ARC and VAXcluster products not only supported parallel computing, but also shared file systems and peripheral devices. The idea was to provide the advantages of parallel processing, while maintaining data reliability and uniqueness. Two other noteworthy early commercial clusters were the Tandem Himalayan (a circa 1994 high-availability product) and the IBM S/390 Parallel Sysplex (also circa 1994, primarily for business use).

Within the same time frame, while computer clusters used parallelism outside the computer on a commodity network, supercomputers began to use them within the same computer. Following the success of the CDC 6600 in 1964, the Cray 1 was delivered in 1976, and introduced internal parallelism via vector processing.[11] While early supercomputers excluded clusters and relied on shared memory, in time some of the fastest supercomputers (e.g. the K computer) relied on cluster architectures.

Attributes of clusters

A load balancing cluster with two servers and N user stations (Galician).

Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use a high-availability approach. Note that the attributes described below are not exclusive and a "computer cluster" may also use a high-availability approach, etc.

"Load-balancing" clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized.[12] However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a web-server cluster which may just use a simple round-robin method by assigning each new request to a different node.[12]

Computer clusters are used for computation-intensive purposes, rather than handling IO-oriented operations such as web service or databases.[13] For instance, a computer cluster might support computational simulations of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach "supercomputing".

"High-availability clusters" (also known as failover clusters, or HA clusters) improve the availability of the cluster approach. They operate by having redundant nodes, which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure. There are commercial implementations of High-Availability clusters for many operating systems. The Linux-HA project is one commonly used free software HA package for the Linux operating system.

Benefits

Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance (the ability for a system to continue working with a malfunctioning node) allows for scalability, and in high performance situations, low frequency of maintenance routines, resource consolidation(e.g. RAID), and centralized management. Advantages include enabling data recovery in the event of a disaster and providing parallel data processing and high processing capacity.[14][15]

In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in the cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers.

When adding a new node to a cluster, reliability increases because the entire cluster does not need to be taken down. A single node can be taken down for maintenance, while the rest of the cluster takes on the load of that individual node.

If you have a large number of computers clustered together, this lends itself to the use of distributed file systems and RAID, both of which can increase the reliability and speed of a cluster.

Design and configuration

A typical Beowulf configuration.

One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares a dedicated network, is densely located, and probably has homogeneous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching grid computing.

In a Beowulf cluster, the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.[13] In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization.[13] The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.[13]

A special purpose 144-node DEGIMA cluster is tuned to running astrophysical N-body simulations using the Multiple-Walk parallel treecode, rather than general purpose scientific computations.[16]

Due to the increasing computing power of each generation of game consoles, a novel use has emerged where they are repurposed into High-performance computing (HPC) clusters. Some examples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters. Another example of consumer game product is the Nvidia Tesla Personal Supercomputer workstation, which uses multiple graphics accelerator processor chips. Besides game consoles, high-end graphics cards too can be used instead. The use of graphics cards (or rather their GPU's) to do calculations for grid computing is vastly more economical than using CPU's, despite being less precise. However, when using double-precision values, they become as precise to work with as CPU's and are still much less costly (purchase cost).[2]

Computer clusters have historically run on separate physical computers with the same operating system. With the advent of virtualization, the cluster nodes may run on separate physical computers with different operating systems which are painted above with a virtual layer to look similar.[17][來源請求][需要解释] The cluster may also be virtualized on various configurations as maintenance takes place. An example implementation is Xen as the virtualization manager with Linux-HA.[17]

Data sharing and communication

Data sharing

A NEC Nehalem cluster

As the computer clusters were appearing during the 1980s, so were supercomputers. One of the elements that distinguished the three classes at that time was that the early supercomputers relied on shared memory. To date clusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it.

However, the use of a clustered file system is essential in modern computer clusters.[來源請求] Examples include the IBM General Parallel File System, Microsoft's Cluster Shared Volumes or the Oracle Cluster File System.

Message passing and communication

Two widely used approaches for communication between cluster nodes are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine).[18]

PVM was developed at the Oak Ridge National Laboratory around 1989 before MPI was available. PVM must be directly installed on every cluster node and provides a set of software libraries that paint the node as a "parallel virtual machine". PVM provides a run-time environment for message-passing, task and resource management, and fault notification. PVM can be used by user programs written in C, C++, or Fortran, etc.[18][19]

MPI emerged in the early 1990s out of discussions among 40 organizations. The initial effort was supported by ARPA and National Science Foundation. Rather than starting anew, the design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use TCP/IP and socket connections.[18] MPI is now a widely available communications model that enables parallel programs to be written in languages such as C, Fortran, Python, etc.[19] Thus, unlike PVM which provides a concrete implementation, MPI is a specification which has been implemented in systems such as MPICH and Open MPI.[19][20]

Cluster management

Low-cost and low energy tiny-cluster of Cubieboards, using Apache Hadoop on Lubuntu

One of the challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes.[21] In some cases this provides an advantage to shared memory architectures with lower administration costs.[21] This has also made virtual machines popular, due to the ease of administration.[21]

Task scheduling

When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes a challenge. In a heterogeneous CPU-GPU cluster with a complex application environment, the performance of each job depends on the characteristics of the underlying cluster. Therefore, mapping tasks onto CPU cores and GPU devices provides significant challenges.[22] This is an area of ongoing research; algorithms that combine and extend MapReduce and Hadoop have been proposed and studied.[22]

Node failure management

When a node in a cluster fails, strategies such as "fencing" may be employed to keep the rest of the system operational.[23][需要較佳来源][24] Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks.[23]

The STONITH method stands for "Shoot The Other Node In The Head", meaning that the suspected node is disabled or powered off. For instance, power fencing uses a power controller to turn off an inoperable node.[23]

The resources fencing approach disallows access to resources without powering off the node. This may include persistent reservation fencing via the SCSI3, fibre channel fencing to disable the fibre channel port, or global network block device (GNBD) fencing to disable access to the GNBD server.

Software development and administration

Parallel programming

Load balancing clusters such as web servers use cluster architectures to support a large number of users and typically each user request is routed to a specific node, achieving task parallelism without multi-node cooperation, given that the main goal of the system is providing rapid user access to shared data. However, "computer clusters" which perform complex computations for a small number of users need to take advantage of the parallel processing capabilities of the cluster and partition "the same computation" among several nodes.[25]

Automatic parallelization of programs continues to remain a technical challenge, but parallel programming models can be used to effectuate a higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors.[25][26]

Debugging and monitoring

The development and debugging of parallel programs on a cluster requires parallel language primitives as well as suitable tools such as those discussed by the High Performance Debugging Forum (HPDF) which resulted in the HPD specifications.[19][27] Tools such as TotalView were then developed to debug parallel implementations on computer clusters which use MPI or PVM for message passing.

The Berkeley NOW (Network of Workstations) system gathers cluster data and stores them in a database, while a system such as PARMON, developed in India, allows for the visual observation and management of large clusters.[19]

Application checkpointing can be used to restore a given state of the system when a node fails during a long multi-node computation.[28] This is essential in large clusters, given that as the number of nodes increases, so does the likelihood of node failure under heavy computational loads. Checkpointing can restore the system to a stable state so that processing can resume without having to recompute results.[28]

Some implementations

The GNU/Linux world supports various cluster software; for application clustering, there is distcc, and MPICH. Linux Virtual Server, Linux-HA - director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. MOSIX, LinuxPMI, Kerrighed, OpenSSI are full-blown clusters integrated into the kernel that provide for automatic process migration among homogeneous nodes. OpenSSI, openMosix and Kerrighed are single-system image implementations.

Microsoft Windows computer cluster Server 2003 based on the Windows Server platform provides pieces for High Performance Computing like the Job Scheduler, MSMPI library and management tools.

gLite is a set of middleware technologies created by the Enabling Grids for E-sciencE (EGEE) project.

slurm is also used to schedule and manage some of the largest supercomputer clusters (see top500 list).

Other approaches

Although most computer clusters are permanent fixtures, attempts at flash mob computing have been made to build short-lived clusters for specific computations. However, larger-scale volunteer computing systems such as BOINC-based systems have had more followers.

See also

Basic concepts

Distributed computing

Specific systems

Computer farms

References

  1. ^ Cluster vs grid computing. Stack Overflow. 
  2. ^ 2.0 2.1 Graham-Smith, Darien. Weekend Project: Build your own supercomputer. PC & Tech Authority. 29 June 2012 [2 June 2017]. 
  3. ^ Bader, David; Pennington, Robert. Cluster Computing: Applications. Georgia Tech College of Computing. May 2001 [2017-02-28]. 
  4. ^ Nuclear weapons supercomputer reclaims world speed record for US. The Telegraph. 18 Jun 2012 [18 Jun 2012]. 
  5. ^ Gray, Jim; Rueter, Andreas. Transaction processing : concepts and techniques. Morgan Kaufmann Publishers. 1993. ISBN 1558601902. 
  6. ^ 6.0 6.1 6.2 Network-Based Information Systems: First International Conference, NBIS 2007: 375. ISBN 3-540-74572-6. 
  7. ^ William W. Hargrove, Forrest M. Hoffman and Thomas Sterling. The Do-It-Yourself Supercomputer. Scientific American 265 (2). August 16, 2001: 72–79 [October 18, 2011]. 
  8. ^ Hargrove, William W.; Hoffman, Forrest M. Cluster Computing: Linux Taken to the Extreme. Linux Magazine. 1999 [October 18, 2011]. 
  9. ^ Yokokawa, Mitsuo; et al. The K computer: Japanese next-generation supercomputer development project. International Symposium on Low Power Electronics and Design (ISLPED): 371–372. 1–3 August 2011. doi:10.1109/ISLPED.2011.5993668. 
  10. ^ Pfister, Gregory. In Search of Clusters 2nd. Upper Saddle River, NJ: Prentice Hall PTR. 1998: 36. ISBN 0-13-899709-8. 
  11. ^ Hill, Mark Donald; Jouppi, Norman Paul; Sohi, Gurindar. Readings in computer architecture. 1999: 41–48. ISBN 978-1-55860-539-8. 
  12. ^ 12.0 12.1 Sloan, Joseph D. High Performance Linux Clusters. 2004. ISBN 0-596-00570-9. 
  13. ^ 13.0 13.1 13.2 13.3 Daydé, Michel; Dongarra, Jack. High Performance Computing for Computational Science - VECPAR 2004. 2005: 120–121. ISBN 3-540-25424-2. 
  14. ^ IBM Cluster System : Benefits. IBM. [8 September 2014]. (原始内容存档于29 April 2016). 
  15. ^ Evaluating the Benefits of Clustering. Microsoft. 28 March 2003 [8 September 2014]. (原始内容存档于22 April 2016). 
  16. ^ Hamada, Tsuyoshi; et al. A novel multiple-walk parallel algorithm for the Barnes–Hut treecode on GPUs – towards cost effective, high performance N-body simulation. Computer Science - Research and Development. 2009, 24: 21–31. doi:10.1007/s00450-009-0089-1. 
  17. ^ 17.0 17.1 Mauer, Ryan. Xen Virtualization and Linux Clustering, Part 1. Linux Journal. 12 Jan 2006 [2 Jun 2017]. 
  18. ^ 18.0 18.1 18.2 Milicchio, Franco; Gehrke, Wolfgang Alexander. Distributed services with OpenAFS: for enterprise and education. 2007: 339–341. ISBN 9783540366348. 
  19. ^ 19.0 19.1 19.2 19.3 19.4 Prabhu, C.S.R. Grid and Cluster Computing. 2008: 109–112. ISBN 8120334280. 
  20. ^ Gropp, William; Lusk, Ewing; Skjellum, Anthony. A High-Performance, Portable Implementation of the MPI Message Passing Interface. Parallel Computing. 1996. CiteSeerX 10.1.1.102.9485可免费查阅. 
  21. ^ 21.0 21.1 21.2 Patterson, David A.; Hennessy, John L. Computer Organization and Design. 2011: 641–642. ISBN 0-12-374750-3. 
  22. ^ 22.0 22.1 K. Shirahata; et al. Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters. Cloud Computing Technology and Science (CloudCom): 733–740. 30 Nov – 3 Dec 2010. ISBN 978-1-4244-9405-7. doi:10.1109/CloudCom.2010.55. 
  23. ^ 23.0 23.1 23.2 Robertson, Alan. Resource fencing using STONITH (PDF). IBM Linux Research Center. 2010. 
  24. ^ Vargas, Enrique; Bianco, Joseph; Deeths, David. Sun Cluster environment: Sun Cluster 2.2. Prentice Hall Professional. 2001: 58. ISBN 9780130418708. 
  25. ^ 25.0 25.1 Aho, Alfred V.; Blum, Edward K. Computer Science: The Hardware, Software and Heart of It. 2011: 156–166. ISBN 1-4614-1167-X. 
  26. ^ Rauber, Thomas; Rünger, Gudula. Parallel Programming: For Multicore and Cluster Systems. 2010: 94–95. ISBN 3-642-04817-X. 
  27. ^ Francioni, Joan M.; Pancake, Cherri M. A Debugging Standard for High-performance computing. Scientific Programming (Amsterdam, Netherlands: IOS Press). April 2000, 8 (2): 95–108. ISSN 1058-9244. doi:10.1155/2000/971291. 
  28. ^ 28.0 28.1 Sloot, Peter (编). Computational Science-- ICCS 2003: International Conference: 291–292. 2003. ISBN 3-540-40195-4. 

Further reading

  • Baker, Mark; et al. Cluster Computing White Paper. 11 Jan 2001. arXiv:cs/0004014可免费查阅. 
  • Marcus, Evan; Stern, Hal. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons. ISBN 0-471-35601-8. 
  • Pfister, Greg. In Search of Clusters. Prentice Hall. ISBN 0-13-899709-8. 
  • Buyya, Rajkumar (编). High Performance Cluster Computing: Architectures and Systems 1. NJ, USA: Prentice Hall. 1999. ISBN 0-13-013784-7. 
  • Buyya, Rajkumar (编). High Performance Cluster Computing: Architectures and Systems 2. NJ, USA: Prentice Hall. 1999. ISBN 0-13-013785-5. 

Template:Parallel Computing