“He has thirty men’s heft of grasp in the gripe of his hand, the bold-in-battle. Blessed God out of his mercy this man hath sent to Danes of the West, as I ween indeed, against horror of Grendel” - Beowulf
My Beowulf cluster contains two Raspberry Pi’s (Model B and Model B+), two 2004 Mac G4 iBooks loaded with Lubuntu, a Chromebook with crouton, a Intel i7 desktop with VirtualBox hosting Ubuntu, and a Intel Xeon virtual server with Ubuntu.
This program is a example of a Beowulf cluster or so-called Stone SouperComputer when hooked up with antiquated computers. My program uses Python and SSH and is very simple (< 200 lines). However, there are pretty good libraries like MPI out there that can be used instead for more intensive applications. When computers are in clusters they can do cool things, like carry out the great prime search, or solve the protein folding problem. My program is a simple way of doing something similar, using basically any computer that can run at least Lubuntu with very minimal coding.
How does it work? Basically, the server connects to any number of client computers via SSH and asks them to help compute some problem. Once a client finishes, the client sends back to the result to the server which stores the result on its own disk. The server then sends that client a new set of computations to finish, and this repeats until all the computations are finished. The client never has to store any information. The server is able to keep track of the client threads, the overall productivity and productivity of each client, and the entirety of the finished results.
The following information and code is readily available to download from my Github.
Beowulf clusters can be used for computation. My cluster is more of a Stone Soupercomputer - a cluster of antiquated computers - so some of the computers will be exceedingly slow.
The example I’ve included here is a computational-use - the computation of the first ten million primes. As an example of the disparity between computers when using a Beowulf cluster of old and new, here is a plot of the rate of prime searching for the computers in my cluster:
As you can see, you’d need about five 10-year-old computers to equivalate a single modern desktop. Its probably just better to buy a desktop if this is what you want. Modern computers are faster and more energy efficient, so using antiquated computers for computation probably is not a wise choice.
A Beowulf cluster is especially suited for scraping, since the bottleneck is mainly the connection time to the website. Thus, the web scraping with a Beowulf cluster will roughly scale with the number of computers (or cores per computers). If you do use this for a web scraper you might like to use Tor so you don’t get blocked - I’ve included a code block that utilizes Tor. Here is a plot of my results from the same computers I used above for scraping a website with and without Tor:
For this purpose, the slowest, older computers are only about two times slower than he fastest, newest computers. And by combining their power, you can basically recover the entirety of their might as shown by the actual sites per minute gathered by the cluster versus the sum of individual components:
Of course, you’ll also see that Tor will slow you down quite a bit if you choose to use that extra security boost.
Note, if you do use this as a web scraper, be sure to read the Terms of Service and make sure to follow
robots.txt as some sites do not allow web scraping (and be kind not to overload any servers with thousands of connections per second).
How it works
The server computer needs to have SSH access to each of the computers in the cluster using a SSH key (to alleviate having to store passwords in the program). You can setup SSH pretty easily by installing
openssh-client on each of the cluster computers. The cluster computers can be LAN or Public, as long as you have access to them via SSH and they can run Linux. Also make sure to install whatever Python libraries you’ll need on the corresponding computers, as this program doesn’t do that yet.
To generate a SSH key on the server computer, simply use
And just press enter/enter/enter.
Then copy the ssh key to all your connected computers using
And type in your password. That should be the only time you need to type in your password. Then, if the clients have the right python libraries, you can run the server program
python server_primes.py, to automatically make the directories and transfer the client program and run the client program.
The server sends a SSH command to run a command on the client computer. Do not send anything private to the client, because it is run as a command and might be saved in the client history. The client then runs the program and sends back the output as a JSON which I believe is secured through SSH.
I included the ability to add in Tor connections. The client script uses the first argument as a Tor flag. To use Tor you need to install tor
apt-get install tor as well as PySocks
pip install PySocks on the client computers.
Since Tor needs to run as a super user, you can create store your password on your client. Since its never a good idea to store the plaintext password, I suggest using base64. I.e. Type your password into
~/pass and then
The program is then set to use this password for running the Tor connections.