Category Archives: BigData

AWS Elastic Map Reduce Quick Start – Dashboard

This post provides essential instructions on how to get started with  Amazon Elastic MapReduce  (Amazon EMR). You will learn how to create a sample Amazon EMR cluster by using the AWS Management Console. You then run a Hive script to process data stored in Amazon S3.

The instructions in this example do not apply to production environments and they do not cover in depth configuration options. The example shows how to quickly set up a cluster for evaluation purposes. For questions or issues you can reach out to the Amazon EMR team by posting on the Discussion Forum.

Cost

The sample cluster that you create runs in a live environment and you are charged for the resources used. This example should take an hour or less, so the charges should be minimal. After you complete this example, you should reset your environment to avoid incurring further charges.For more information, see  Reset EMR Environment.

Pricing for Amazon EMR varies by region and service. For this example, charges accrue for the Amazon EMR cluster and Amazon Simple Storage Service (Amazon S3) storage of the log data and output from the Hive job. If you are within your first year of using AWS, some or all of your charges for Amazon S3 might be waived if you are within your usage limits of the AWS Free Tier.
For more information about Amazon EMR pricing and the AWS Free Tier, go to Amazon EMR Pricing   and  AWS Free Tier.

You can use the Amazon Web Services Simple Monthly Calculator to estimate your bill.

Sample EMR Cluster Prerequisites

The following are the preliminary steps you must perform to complete the example.

  1. Create an AWS account.
  2. Create an S3 bucket.
    The example in this topic uses an S3 bucket to store log files and output data.
    Due to Hadoop constraints, the bucket name should conform to these requirements:

    • It must contain lower case letter, numbers, periods and hyphens.
    • It cannot end with a number.
      Example: mycompany.username.vernumber-emr-quickstart.
  3. Click on the S3 bucket name. The bucket page is displayed.
  4. Create 2 folders named: logs and output respectively.
    Make sure that the output folder is empty. For more information, see Creating a Folder.
  5. Create an Amazon EC2 Key Pair.
    You need the key pair to connect to the nodes in the cluster.

Launch the Sample Amazon EMR Cluster

  1. In your browser, navigate to the Amazon management console.
  2. In the Analytics section click on EMR. The console dashboard is displayed.
    EMR Console
  3. Click the Create cluster button.
    The Create Cluster – Quick Options page is displayed.
    For more information, see Using Quick Cluster Configuration Overview
  4. Accept the default values except for the following fields:
    • In the Cluster name box, enter any name that has meaning to you
    • For the S3 folder box, click on the folder icon to select the path to the logs folder that you created.
    • For the EC2 key pair box, from the drop-down list, choose the key pair that you created.
  5. Click the Create cluster button.

AWS Elastic Map Reduce (EMR)

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.You can also run other popular distributed frameworks such as Apache Spark and Presto (SQL Query Engine) in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. For a quick overview, see Introduction to Amazon Elastic MapReduce.

Background

Amazon EMR enables you to quickly and easily provision as much computing capability as you need and add or reduce or remove it at any time. This is very important when dealing with variable or unpredictable processing requirements as it is often the case with big data processing.
For example, if the bulk of your processing occurs at night, you might need 100 virtual machine instances during the day and 500 instances at night. Or you might need a significant computing peak for a short period of time. With Amazon EMR you can quickly provision hundreds or thousands of instances, and release them when the work is completed. saving on the overall cost.

Computing Capacity

The following are some possible way to control computing capacity: