Heard Hadoop for many times, never got deep into it. Now is a chance, so I started the experiment in Microsoft Azure.
First, you can find a number of Hadoop packages in Azure Marketplace. I chose Hortonworks Sandbox with HDP 2.5. I tried Hadoop by Bitnami as well, but it’s usability is a bit tricky, I couldn’t find a way to make Bitnami work without creating a number of accounts and expose more of my own information. I may try it later (and enable the boot diagnostic to find the password in the log when the image starts the first time) when I have time. For now, I stick to Hortonworks.
Then just follow the standard Azure procedure –
- Basics: filling in the VM name, username, SSH key or password, subscription, resource group, location, etc.
- Size: choose the size of the VM.
- Settings: choose the storage, network, etc. I suggest to leave boot diagnostics enabled.
- Confirm on the Summary and Buy.
Notice that on the price page, there is warning on the charge besides the Azure VM itself, also since the HDP Sandbox just showed 0.0000 CAD/hr, I don’t think you need to worry too much about it. BTW, Bitnami’s Hadoop is also free, explicitely mentioned.
Wait for a few minutes until the deployments succeeds. You can then check the status of your new Hadoop VM. Hortonworks suggests that you make the public IP static. You can find more detail information on its tutorial page.
Next is to configure your SSH client. I am using PuTTY on Windows, so there are more mouse-clicks than the config example given in the tutorial. Basically these settings let you connect to your VM in the Azure cloud using various ports from localhost to the remote VM via the SSH tunnel you set up here.
Here is how to configure PuTTY:
- Fill in the public IP of your Hadoop VM
- Expand Connection – SSH
- Click Tunnels
- Fill the source and destination, then click Add button
According to the document, you need to add 8 forward ports
So in PuTTY, you can add one by one, it should eventually look like this (scroll up and down to see total 8 lines/ports).
You can then go back to the Session page, give a name in “Saved Sessions” and save the configuration. Next time, you only need to load it from there.
One trick is that the VM need some time to start and become stable. My first few login attempts failed, only after 20 or 30 minutes can I eventually login. so be patient. After login, you should be able to see the following directories.
Then according to the tutorial, keep the SSH session active, you can use brower to visit this page on your VM.
Click on the left icon, you will see the dashboard.
Click on the right, you can read more advanced topics including the default username and password, and how to change them.
That is the first step into Hadoop.