Wednesday, 20 April 2016

Installing and working with ELK Stack - Part 1 (Environment Setup)

I recently got to work with the ELK Stack. It can be considered as an open-source replacement for Splunk (which btw is pretty impressive in itself if you have a good amount of extra cash in your accounts). The ELK Stack is used for better analysis and visualization of your log files.


ELK Stack Infrastructure (Pic courtesy Mitchell Anicas's tutorial on ELK Stack )


In this Part-1 of the tutorial, I'll go over the Environment Setup for the ELK Stack. ELK Stack comprises of the following three technologies.

Elasticsearch (E): The amazing ElasticSearch datastore server. Its a search server based on Lucene which provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. For more information please visit ElasticSearch Wiki.

Logstash (L): It is an open source tool for collecting, parsing, and storing logs for future use. For more information please visit LogStash Wiki.

Kibana (K): Its an open source data visualization plugin for ElasticSearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. For more information please visit Kibana Wiki.

Question: If we already have data in the logs then why do we need this??

Answer: That's a very good question. Although, you gather a lot of logs - syslogs, user logs, etc., but still its very difficult to go through them and make any scheduled analysis/report. Here's where the ELK stack helps you. You can load your logs into ElasticSearch using LogStash and then use Kibana for querying and generating report dashboards from this data. Pretty Awesome!!!!


Pre-requisites


Minimum System Requirements (The More The Merrier!)

  • Ubuntu 14.04 LTS
  • 1 GB RAM
  • 50 GB HDD
  • 2 CPUs


Preliminary Steps

Lets begin with some ground work which will help us in future.


#update and upgrade all installed packages
$ sudo apt-get update && sudo apt-get upgrade

#install java
$ sudo apt-get install openjdk-7-jre

#create a "Downloads" folder so as to keep all downloads at one place
$ cd ~
$ mkdir Downloads



Installing ElasticSearch


Although it can be installed using a package manager but I still like to do it the old school way, i.e., by downloading the debian package and then installing it.


$ cd ~/Downloads

#Download and unpack ElasticSearch. Latest release at the time of this writing is v2.3.1
$ wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.1/elasticsearch-2.3.1.deb
$ sudo dpkg -i elasticsearch-2.3.1.deb

#enable elastic search to start automatically at boot
$ sudo update-rc.d elasticsearch defaults 95 10

#start service
$ sudo service elasticsearch start




Installing LogStash


Now lets install LogStash in the same way - by downloading and installing the debian package


$ cd ~/Downloads

#Download and unpack LogStash. Latest release at the time of this writing is v2.3.1
$ wget https://download.elastic.co/logstash/logstash/packages/debian/logstash_2.3.1-1_all.deb
$ sudo dpkg -i logstash_2.3.1-1_all.deb




Testing LogStash


After installation its a good idea to verify the LogStash install. We can do it using the following.


$ cd ~
$ /opt/logstash/bin/logstash agent -e "input {stdin { } } output { stdout { codec => rubydebug } }"


Type Hello, Logstash!
You should see an output somewhat like this


{
"message" => "Hello, Logstash!",
"@version" => "1",
"@timestamp" => "2014-07-28T01:27:27.231Z",
"host" => "elkstack"
}


You can exit LogStash using CTRL+D



Installing Kibana



Now lets install the final part of the stack, i.e., Kibana.


$ cd ~/Downloads

#Add the Kibana repo to the sources. Latest release at the time of this writing is v4.5
$ wget https://download.elastic.co/kibana/kibana/kibana_4.5.0_amd64.deb
$ sudo dpkg -i kibana_4.5.0_amd64.deb


We need to configure the server host name in the Kibana settings. For this, we will open Kibana's yaml file to replace and uncomment the server.host value with "localhost"


$ sudo editor /opt/kibana/config/kibana.yml


Also, lets allow Kibana to start automatically at boot


$ sudo update-rc.d kibana defaults 96 9
$ sudo service kibana start #start kibana service




Installing Nginx


Lets also install Nginx so that we can access our Kibana Dashboard from browser


#install nginx
$ sudo apt-get install nginx apache2-utils

#create a new user in htpasswd for kibana login
$ sudo htpasswd -c /etc/nginx/htpasswd.users <YourUsername>

#edit the nginx config file
$ sudo mv /etc/nginx/sites-available/default /etc/nginx/sites-available/default.old #create backup
$ sudo editor /etc/nginx/sites-available/default #create new default


Add the following code to the "default" file and save it


server {
  #remember to check that no other service is running on port 80. Apache(if installed) should be stopped.
  listen 80;
  server_name <YourServerName>;

  auth_basic "Restricted Access";
  auth_basic_user_file /etc/nginx/htpasswd.users; #use the htpasswd.users to match the credentials

  location / {
    proxy_pass http://localhost:5601;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection 'upgrade';
    proxy_set_header Host $host;
    proxy_cache_bypass $http_upgrade;
  }
}

#finally lets restart the Nginx service
sudo service nginx restart


If all works well you should be able to access your Kibana Dashboard through your browser. Just browse the host name of your machine and enter the username and password for Kibana.

This completes the Part-1 of the ELK Stack, i.e., the setup part.

Coming Soon: 
  • Part-2, in which I'll discuss about how to work with the ELK Stack and load data into ElasticSearch from a remote database using custom LogStash config. I'll also show how to work with Kibana dashboards to create simple visualizations.
  • Youtube Video tutorials for both parts.
Enjoy Analyzing your logs!!! 

Stay tuned by subscribing to this blog.

Monday, 2 March 2015

Data Analysis using Twitter API and Python

As the title suggests, I'll be working here with the Twitter Search API, to get some tweets based on a search paramenter and try to analyze some information out of the Data received.

For the purpose of working with the Twitter API, I'll be using the "tweepy" library in Python (http://www.tweepy.org/), which makes it a lot easier to access and use the Twitter API.

I have added comments throughout the code (wherever required) so as to make it easier to understand. I'll be using "pandas" library in Python (http://pandas.pydata.org/) to convert the returned JSON format into a DataFrame to work on it and analyze it.

I am using the IPython Notebook Server on AWS, which I created in my last post

So let's begin...


Process

1. First step would be to import the required libraries. We'll be needing three main libraries - "tweepy", "pandas" and "matplotlib.pyplot". 
"tweepy" is used to access the Twitter API; "pandas" is used to analyze the data received; and "pyplot" is used to plot the analyzed data, so as to make it easier to understand.


# Import the required libraries.
import tweepy
import pandas as pd
import matplotlib.pyplot as plt


2. I'll also use a short line of code to make the graphs/plot look a bit prettier than normal


# Make the graphs prettier
pd.set_option('display.mpl_style', 'default')


3. Now its time to go to apps.twitter.com and create a new app and get the consumer_key and consumer_secret key from that app. You can search online if you have any difficulties in doing so, but it is a very straight-forward process.
We will assign these keys to respective variables for use, and also connect to the Twitter API using these along-with "tweepy".


consumerKey = '<Your Consumer Key>'
consumerSecret = '<Your Consumer Secret Key>'

#Use tweepy.OAuthHandler to create an authentication using the given 
    key and secret
auth = tweepy.OAuthHandler(consumer_key=consumerKey, 
    consumer_secret=consumerSecret)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)


4. Now we will perform a Search operation on the Twitter feed to get some data. For the purpose of this tutorial, I am going to search for "#Oscars2015" as it was a most recent and very popular event which many people tweeted about throughout the World.


#Perform a basic search query where we search for the
    '#Oscars2015' in the tweets
result = api.search(q='%23Oscars2015') #%23 is used to specify '#'

# Print the number of items returned by the search query to 
    verify our query ran. Its 15 by default
len(result)



5. The next step would be to study the data in one tweet so as to see its type and check out what all information can be used to generate some analysis


tweet = result[0] #Get the first tweet in the result

# Analyze the data in one tweet to see what we require
for param in dir(tweet):
#The key names beginning with an '_' are hidden ones and usually 
 not required, so we'll skip them
    if not param.startswith("_"):
        print "%s : %s\n" % (param, eval('tweet.'+param))



6. I have not displayed the output above as its a lot of information which is returned regarding one tweet and I want to save space. Its pretty easy to understand though.
Each tweet is in JSON format, and there is a lot of information about the author of the tweet, the text of the tweet, the location, etc. 
Its always a good idea to go through the data once to understand it and then try and utilize it for analysis. At least that is what I tend to do.
Now we know the data of a tweet, and so we can get a good amount of data to use it for the purpose of analysis. For this I'll download first 5000 tweet based on the same search query (#Oscars2015), as it was one of the trending topics on Twitter. I'll make use of the "tweepy Cursor" for the same.


results = []

#Get the first 5000 items based on the search query
for tweet in tweepy.Cursor(api.search, q='%23Oscars2015').items(5000):
    results.append(tweet)

# Verify the number of items returned
print len(results)


Output of above:

7. Now I have my data, but as it is in JSON format it will be somewhat hard to utilize for data analysis. We have to separate out only the data which is required for the analysis. 
To do so, we'll use the pandas library of Python. This is one of the most handy library's for a Data Scientist/Analyst, as it helps in representing data and also analyzing it in a sophisticated and user-friendly manner.


# Create a function to convert a given list of tweets into 
  a Pandas DataFrame.
# The DataFrame will consist of only the values, which I 
  think might be useful for analysis...


def toDataFrame(tweets):

    DataSet = pd.DataFrame()

    DataSet['tweetID'] = [tweet.id for tweet in tweets]
    DataSet['tweetText'] = [tweet.text for tweet in tweets]
    DataSet['tweetRetweetCt'] = [tweet.retweet_count for tweet 
    in tweets]
    DataSet['tweetFavoriteCt'] = [tweet.favorite_count for tweet 
    in tweets]
    DataSet['tweetSource'] = [tweet.source for tweet in tweets]
    DataSet['tweetCreated'] = [tweet.created_at for tweet in tweets]


    DataSet['userID'] = [tweet.user.id for tweet in tweets]
    DataSet['userScreen'] = [tweet.user.screen_name for tweet 
    in tweets]
    DataSet['userName'] = [tweet.user.name for tweet in tweets]
    DataSet['userCreateDt'] = [tweet.user.created_at for tweet 
    in tweets]
    DataSet['userDesc'] = [tweet.user.description for tweet in tweets]
    DataSet['userFollowerCt'] = [tweet.user.followers_count for tweet 
    in tweets]
    DataSet['userFriendsCt'] = [tweet.user.friends_count for tweet 
    in tweets]
    DataSet['userLocation'] = [tweet.user.location for tweet in tweets]
    DataSet['userTimezone'] = [tweet.user.time_zone for tweet 
    in tweets]

    return DataSet

#Pass the tweets list to the above function to create a DataFrame
DataSet = toDataFrame(results)



8. Let's verify our data-frame by getting the first 5 and last 2 records in it


# Let's check the top 5 records in the Data Set
DataSet.head(5)






# Similarly let's check the last 2 records in the Data Set
DataSet.tail(2)





9. OK, now we have our data in an organized and easy to understand way. Its time to look at it and do some cleaning operations.
This step is required so that we only use valid values/records in our data and neglect the invalid values/records as they don't have any significance and also so that they don't have any negative impact on our results or analysis
While going through this data-frame, I saw that there are some values in the column 'userTimezone' which corresponds to 'None'. I have planned to use this column to see the Time Zone, where people were most active on tweeting about the 2015 Oscars. For this, I have to first remove the records with 'None' values in it, as they are not at all useful for me.



# 'None' is treated as null here, so I'll remove all the records having 
  'None' in their 'userTimezone' column
DataSet = DataSet[DataSet.userTimezone.notnull()]

# Let's also check how many records are we left with now
len(DataSet)


10. We can see that after removing the records with the 'None' values, I am left with 3376 records, out of my initial 5000 records.
These records are valid ones and we can use them for our anaysis.
The next step will be to count the number of tweets in each Time Zone, which can be done using the following one line of code (that is the beauty of "pandas")



# Count the number of tweets in each time zone and get the first 10
tzs = DataSet['userTimezone'].value_counts()[:10]
print tzs



11. Here i am only using the top 10 timezones, so lets slice the data-frame to get only the first 10.
Now I have my top 10 Time Zones from where the most number of tweets were generated for Oscars 2015 (based on the 3376 records which were remaining from the initial 5000 records, after cleaning). These top 10 are shown in the table format above. But a table is not always a good way of representing data, so we'll plot this data on a bar-graph to make it more presentable and understanding to the users



# Create a bar-graph figure of the specified size
plt.rcParams['figure.figsize'] = (15, 5)

# Plot the Time Zone data as a bar-graph
tzs.plot(kind='bar')


# Assign labels and title to the graph to make it more presentable
plt.xlabel('Timezones')
plt.ylabel('Tweet Count')
plt.title('Top 10 Timezones tweeting about #Oscars2015')




Analysis

From the graph above, we can make out that most number of Tweets about the Oscars 2015 were generated from the time zone of Rome(Central European Timezone), which was followed by Pacific Time(US & Canada) and then by Easter Time(US & Canada).

This is good to know as we can see how much active the people are in Europe about the Oscars. Even more than the Time Zone in which the Oscars Ceremony took place, i.e., Pacific Time.

It is also kind of surprising to me, because I thought that the Hollywood Stars would be still posting tweets, even when they were attending the Oscars Ceremony to keep the media and their fans updated about it. But I think the people in Rome/Europe are more excited about Oscars, even more than the Hollywood itself... ;) 


Remarks

This was just an attempt on explaining how to use Twitter Search API with python and 
perform some Data Analysis. I just took "#Oscars2015" as an example, but you can try with something different and more useful.

Feel free to drop any comments or suggestions or any other topic which you might want me to write a post on and I'll try to answer it asap.

Hope you guys enjoyed this short tutorial. Enjoy and Happy Tweeting!!

Monday, 16 February 2015

Running an iPython Notebook Server on AWS - EC2 Instance

Updates:
7th January, 2016 - changes made according to new Anaconda distribution (v2.4.1) which contains Jupyter Notebook.
1st December, 2016 - changes made for adding keyfile in notebook config
Note: The update to the video tutorial is still in progress so please don't refer it for now. Once, I have updated it, I'll remove this note from here.


I hope everyone is familiar with the AWS (Amazon Web Services) and how to use iPython (Now Jupyter) Notebooks. If you are not familiar with Jupyter Notebook and you work with Python, then you are definitely missing a very important tool in you work. Please go through this video which is a short tutorial on iPython (Jupyter) Notebook.

OK, to begin with, I'll list all the steps to create an Jupyter Notebook Server on an EC2 Instance in a step-wise fashion. I have also created a Youtube Video for this post, which you can check it out here. (update in progress to the video. please don't refer it for now)

The reason for deploying Jupyter Notebook Server on AWS is to access all my Notebooks from anywhere in the World, just using my browser and also be able to work with them.

Enough of talking, let's begin:

1. Login to your Amazon Management Console. If you don't have an account yet, you can create one for it. You get 1 yr of free access to some of the services, which you can check out at this link

2. Create a new EC2 Instance with Ubuntu. If you are not familiar with how to create an EC2 instance, you can check out the video of this blog, in which I go through the steps from the beginning.

3. The important thing to remember while creating the instance is to assign the security group settings as mentioned in the image below


4. Launch your instance and ssh into it to perform the following operations
  • First of all we are going to use Anaconda Python Distribution for installing all the required Python libraries. This is a free distribution and we are going to use the Linux version of it. Remember to verify the latest version of the distribution from the site. This blog is updated to reflect the changes in the latest Anaconda distribution - 2.4.1.
    $ wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.1-Linux-x86_64.sh
       
  • Next we will bash to run this .sh file. You need to accept the license terms and set the installation directory for the anaconda distribution. I use the default one only, which is "/home/ubuntu/anaconda2/". Also, it asks you to add the default path of anaconda python to your .bashrc profile. You can accept to do it or add it manually.
    $ bash Anaconda2-2.4.1-Linux-x86_64.sh
       
  • Now you need to check, which python version you are using, just to confirm if we are using the one from Anaconda Distribution or not. You can do this by using
    $ which python
       
    This will list the python your system is currently using. If it does not mentions the one from ".../anaconda2/..." folder, then you can use the following command to re-load your .bashrc, so as to set the correct python
    $ source .bashrc
       
  • Open the iPython Terminal to get an encrypted password so as to use it for logging into our iPython Notebook Server. Remember to copy and save the output of this command, which will be an encrypted password, something like "sha1..."
    $ ipython
    In [1]:from IPython.lib import passwd
    In [2]:passwd()
    and exit out of ipython terminal using "exit" command. [ I'm not gonna use this password(shown in the pic below), so don't waste your time trying to copy and use it. ;) ]
  • Now we're going to create the configuration profile for our Jupyter Notebook server
    $ jupyter notebook --generate-config
       
  • The next thing is going to be to create a self-signed certificate for accessing our Notebooks through HTTPS
    $ mkdir certs
    $ cd certs
    $ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.key -out mycert.pem
       
    it will ask some questions, please answer them to the best of your knowledge as some of them are required to successfully create the certificate.
  • It's time to change the config settings of our server
    $ cd ~/.jupyter/
    $ vi jupyter_notebook_config.py
       
    You will see a long list of configuration settings. You can go through each one of them and uncomment them as you like, but for me I know what I want, so I'll add the following settings to the top of the file and leave the rest commented as it is.
    c = get_config()
    
    # Kernel config
    c.IPKernelApp.pylab = 'inline'  # if you want plotting support always in your notebook
    
    # Notebook config
    c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem' #location of your certificate file
    c.NotebookApp.keyfile = u'/home/ubuntu/certs/mycert.key' #location of your certificate key
    c.NotebookApp.ip = '*'
    c.NotebookApp.open_browser = False  #so that the ipython notebook does not opens up a browser by default
    c.NotebookApp.password = u'sha1:68c136a5b064...'  #the encrypted password we generated above
    # It is a good idea to put it on a known, fixed port
    c.NotebookApp.port = 8888
    




  • We are almost done. Now its time to start our Jupyter notebook server. For this, first I'll create a new folder which will store all my notebooks
    $ cd ~
    $ mkdir Notebooks
    $ cd Notebooks
       
    and now I'll start my notebook server
    $ jupyter notebook
       




  • 5. And that is all. Now you can access your Notebook from anywhere through your browser. Just navigate to the DNS name, or Public IP of your instance, along with the port number. (By default, the browser adds "http" to the url. Please remember to update it to "https")
    You will be asked by your browser to trust the certificate, as we have signed it on our own, so we know we can trust it. See images for reference below:




    6. Login, using the password you specified when you used the iPython Terminal to create an encrypted version of it and you are good to go.

    7. One thing to note is that if you close the ssh access to your instance, your notebook server will not work. To keep it working, even when you close the ssh access to the server you can use the following command
    $ nohup jupyter notebook
      
    This will put your server process as no-hangup and will keep it running even if you close the ssh access to the instance

    8. Later, if you decide you want to stop this process, you have to find the PID of this process. you can do so by going into your notebooks folder and using the command
    $ lsof nohup.out
      
    which will list the PID of the nohup process which is running(if any).
    Then you can use the kill command to kill this process and stop your ipython notebook server.
    $ kill -9 "PID"
      
    replace "PID" with the ID of the process you get from the "lsof" command.

    So, that is all you need to run an iPython Notebook on AWS EC2 instance. Please leave your comments about this blog post and do remember to check out its video.(update in progress to the video. please don't refer it for now)

    Until next time... :)

    Wednesday, 13 November 2013

    Hadoop-2.2.0 Installation Steps for Single-Node Cluster (On Ubuntu 12.04)



    1.       Download and install VMware Player depending on your Host OS (32 bit or 64 bit)


    2.       Download the .iso image file of Ubuntu 12.04 LTS (32-bit or 64-bit depending on your requirements)


    3.       Install Ubuntu from image in VMware. (For efficient use, configure the Virtual Machine to have at least 2GB (4GB preferred) of RAM and at least 2 cores of processor


    Note: Install it using any user id and password you prefer to keep for your Ubuntu installation. We will create a separate user for Hadoop installation later.


    4.       After Ubuntu is installed, login to it and go to User Accounts(right-top corner) to create a new user for Hadoop


    5.       Click on “Unlock” and unlock the settings by entering your administrator password.


    6.        Then click on “+” at the bottom-left to add a new user. Add the user type as Administrator (I prefer this but you can also select as Standard) and then add the username as “hduser” and create it.
    Note: After creating the account you may see it as disabled. Click on the Dropdown where Disabled is written and select “Set Password” – to set the password for this account or select “Enable” – to enable this account without password.


    7.       Your account is set. Now login into your new “hduser” account.


    8.       Open terminal window by pressing Ctrl + Alt + T


    9. Install openJDK using the following command
    $ sudo apt-get install openjdk-7-jdk

    10. Verify the java version installed
    $ java -version
    java version "1.7.0_25"
    OpenJDK Runtime Environment (IcedTea 2.3.12) (7u25-2.3.12-4ubuntu3)
    OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)


    11. Create a symlink from openjdk default name to ‘jdk’ using the following commands:
    $ cd /usr/lib/jvm
    $ ln -s java-7-openjdk-amd64 jdk


    12. Install ssh server:
    $ sudo apt-get install openssh-client
    $ sudo apt-get install openssh-server


    13. Add hadoop group and user
    $ sudo addgroup hadoop
    $ usermod -a -G hadoop hduser


    To verify that hduser has been added to the group hadoop use the command:
    $ groups hduser


    which will display the groups hduser is in.


    14. Configure SSH:
    $ ssh-keygen -t rsa -P ''
    ...
    Your identification has been saved in /home/hduser/.ssh/id_rsa
    Your public key has been saved in /home/hduser/.ssh/id_rsa.pub
    ...
    $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    $ ssh localhost


    15.Disable IPv6 because it creates problems in Hadoop– Run the following command:
    $ gksudo gedit /etc/sysctl.conf


    16. Add the following line to the end of the file:
    # disable ipv6
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1


    Save and close the file. Then restart the system and login with hduser again.


    17. Download Hadoop - 2.2.0 from the following link to your Downloads folder


    18. Extract Hadoop and move it to /usr/local and make this user own it:
    $ cd Downloads
    $ sudo tar vxzf hadoop-2.2.0.tar.gz -C /usr/local
    $ cd /usr/local
    $ sudo mv hadoop-2.2.0 hadoop
    $ sudo chown -R hduser:hadoop hadoop



    19. Open the .bashrc file to edit it:
    $ cd ~
    $ gksudo gedit .bashrc


    20. Add the following lines to the end of the file:
    #Hadoop variables
    export JAVA_HOME=/usr/lib/jvm/jdk/
    export HADOOP_INSTALL=/usr/local/hadoop
    export PATH=$PATH:$HADOOP_INSTALL/bin
    export PATH=$PATH:$HADOOP_INSTALL/sbin
    export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
    export HADOOP_COMMON_HOME=$HADOOP_INSTALL
    export HADOOP_HDFS_HOME=$HADOOP_INSTALL
    export YARN_HOME=$HADOOP_INSTALL
    #end of paste


    Save and close the file.


    21. Open hadoop-env.sh to edit it:
    $ gksudo gedit /usr/local/hadoop/etc/hadoop/hadoop-env.sh


    and modify the JAVA_HOME variable in the File:


    export JAVA_HOME=/usr/lib/jvm/jdk/


    Save and close the file
    Restart the system and re-login


    22. Verify the Hadoop Version installed using the following command in the terminal:


    $ hadoop version

    The output should be like:
    Hadoop 2.2.0
    Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
    Compiled by hortonmu on 2013-10-07T06:28Z
    Compiled with protoc 2.5.0
    From source with checksum 79e53ce7994d1628b240f09af91e1af4
    This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-2.2.0.jar


    This makes sure that Hadoop is installed and we just have to configure it now.


    23. Run the following command:
    $ gksudo gedit /usr/local/hadoop/etc/hadoop/core-site.xml


    24. Add the following between the <configuration> ... </configuration> tags
    <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
    </property>


    Then Save and close the file


    25. In extended terminal write:
    $ gksudo gedit /usr/local/hadoop/etc/hadoop/yarn-site.xml

    and Paste following between <configuration> … </configuration> tags
    <property>
     <name>yarn.nodemanager.aux-services</name>
     <value>mapreduce_shuffle</value>
    </property>


    <property>
     <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
     <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>


    26. Run the following command:
    $ gksudo gedit /usr/local/hadoop/etc/hadoop/mapred-site.xml.template


    27. Add the following between the <configuration> ... </configuration> tags
    <property>
     <name>mapreduce.framework.name</name>
     <value>yarn</value>
    </property>



    28. Instead of saving the file directly, Save As… and then set the filename as mapred-site.xml. Verify that the file is being saved to the /usr/local/hadoop/etc/hadoop/  directory only


    29. Type following commands to make folders for namenode and datanode:


    $ cd ~
    $ mkdir -p mydata/hdfs/namenode
    $ mkdir -p mydata/hdfs/datanode

      


    30. Run the following:
    $ gksudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml



    31. Add the following lines between the <configuration> … </configuration> tags
    <property>
     <name>dfs.replication</name>
     <value>1</value>
    </property>


    <property>
     <name>dfs.namenode.name.dir</name>
     <value>file:/home/hduser/mydata/hdfs/namenode</value>
    </property>


    <property>
     <name>dfs.datanode.data.dir</name>
     <value>file:/home/hduser/mydata/hdfs/datanode</value>
    </property>



    32. Format the namenode with HDFS:
    $ hdfs namenode -format



    33. Start Hadoop Services:


    $ start-dfs.sh
    ....
    $ start-yarn.sh
    ….



    34. Run the following command to verify that hadoop services are running
    $ jps


    If everything was successful, you should see following services running2583 DataNode
    2970 ResourceManager
    3461 Jps
    3177 NodeManager
    2361 NameNode
    2840 SecondaryNameNode


    Acknowledgement & Reference:
    1. For testing Hadoop File System with copying and deleting files from Local machine, please refer to the following link:



    2. There are not many good references available for the installation of Hadoop - 2.2.0 online (I had to go through a lot of them to actually find the one). I followed the steps as given on the following webpage.


    The changes made are by me to fix the errors which I faced during the installation process.


    3. Check out this video if you wish to see how to configure Ubuntu on VMware. This video also explains how to install the previous (1.x) version of Hadoop

    Hadoop 1.x Video