Using Github in Eclipse Addendum

Create a Repository in the Project Directory

In the Package Explorer right click on the project name.
Select Team->Share Project. The Configure Git Repository dialog is displayed.
Check the box by Use or create repository in parent folder of project.
In the selection area, click on the project folder. The name is duplicated in the box below.
Click the Create Repository button.
Click Finish. The repository shows in the list of the local repositories view.
In the repositories view, select the repository just created. The unstaged project files are displayed in the Unstaged Changes view.

Configure Push Operation

In repository view, right-click on the Remotes node.
In the pop-up dialog window, click Create Remote. The New Remote dialog window is displayed.
Accept the origin default value. Click OK.
In the next dialog window, click the Change button.
Enter the URL of the remote github location. The other values should be filled automatically.
Click Finish.
Click Save.

Merge Local and Remote Repositories

In the Remotes node under origin, look for the green arrow pointing down, this is the Fetch arrow. Right click on the Fetch entry and select Configure Fetch in the popup window. You should see the URI, assure that it points to the remote repository.
Look in the Ref mappings section of the popup. It might be empty. You must indicate which remote references you want to fetch. Click Add.
Type in the branch name you need to fetch from the remote repository, usually master.
Continue through the wizard. Ignore the warning Remote tracking branch ‘refs/remotes/origin/master’ not found in local repository.
Click Finish.
In the last popup window, click Save and Fetch. This will fetch the remote reference
Click OK.

Look in the Branches folder of your local repository. You should now see the remote branch in the Remote Tracking folder. You should see something similar to the following:

You should have a list of un-staged files. Stage all the project files but one (we need a second staging later to push all the committed files).
Enter a comment such as “first commit”.
Click the Commit. This puts the project under configuration control. in the Local folder, you should see something similar to the following:
Expand the Local folder of Branches, right click on the node named master.
Select Merge. The Merge wizard is displayed.
In the Merge wizard, under Remote Tracking folder, select the remote branch named origin/master.
Click Merge.
Follow the merge wizard steps.
Now stage the remaining file. Add comment “first commit”.
Click the Commit and Push
Go through the Wizard steps.
Click the Finish.
Wait for the push operation to be configured and then click the OK button.

Build GCP Drive Client

Leave a reply

This post demonstrates how to build a Google Drive client application in Java. This command line client app shows the basic logic to interact with the Google Drive service and eliminates unnecessary clutter.

The application interacts with the Google Drive via its Drive REST API using the Google Drive Java client library. For more information, go to Google API Client Libraries then click on the Java link. In the menu bar click APIS, then enter Ctrl-F and search for drive, You will get this:

Click on the version link (v3, in the example). This will take you to the Drive API Client Library for Java. Note that at the bottom of the page in the section “Add Library to Your Project”, there are several tabs. If you click the Maven tab, you get the dependency in JSON format to add to the pom.xml file in your project. This is an example:

    <dependency>
      <groupId>com.google.apis</groupId>
      <artifactId>google-api-services-drive</artifactId>
      <version>v3-rev82-1.22.0</version>
    </dependency>

The app uses a simple UI which allows the user to perform tasks such as: list the files in a project, upload files, download files and so on.

You can download the code at this location: gcp-drive-client. See also Import a Maven Project. Please, refer to README file for the latest example code information.

Application Architecture

This section describes the components of the application and delegates the details to the actual code implementation.

Main. Gets authorization to access Google Drive service. Reads the default settings. Instantiates the command classes. Delegates to the SimpleUI class the display of the selection menu and the processing of the user’s input.
SimpleUI. Displays the menu of choices for the user can select from. It processes the
user’s input and calls the proper method based on the user’s selection. Each method calls the related Drive REST API.

FileOperations. Contains methods to perform Google Drive file operations.
The following example code shows how to list the files contained in the user’s account:

 	 /***
	  * Retrieve the list of user's files.
	  * @throws IOException An I/O error has been detected.
	  ***/
	  public static void listFiles() throws IOException {
		  
		  // Define a list to contain file objects.
		  List&amp;lt;File&amp;gt; fileList = new ArrayList&amp;lt;File&amp;gt;();
		  
		  Files.List request = authorizedDriveClient.files().list();
		  
		  
		  do {
		      try {
		        FileList files = request.execute();
	
		        fileList.addAll(files.getItems());
		        request.setPageToken(files.getNextPageToken());
		      } catch (IOException e) {
		        System.out.println("An error occurred: " + e);
		        request.setPageToken(null);
		      }
		      
		      // Display files information.
		      Utilities.displayFilesInfo(fileList);
		    
		  } while (request.getPageToken() != null &amp;amp;&amp;amp;
		             request.getPageToken().length() &amp;gt; 0);

	  }

Various utilities. Used to perform routine command tasks and housekeeping.

Application Workflow

The following figure shows the application time sequence (or workflow).

The first time you start the application, the Main class performs the following actions:

Initializes the default settings.
Creates authorized drive service.
Initializes the command classes
Initializes the SimpleUI class.
Starts the endless loop to process user inputs.

The SimpleUI class keeps processing user inputs, until the user enters the command to exit the loop. At that point, the application terminates.

Application Implementation

Enable Google Drive API

To build the application, you will use Eclipse. Before you can do that, assure that you have enabled the service API as described next.

Follow the steps described in Enable Google Service API.
Download the client credentials information in a file (for example, client_secrets.json). Follow the steps described in Create OAuth Client Credentials.

Create the Application Project

In Eclipse, create a Maven project. For more information, see Create a Maven Project.
Add reference to the authentication app JAR file created in Build GCP Service Client Authentication. Alternatively, and a for quickest results, import the downloaded project. For more information, see Import a Maven Project

Modify the pom.xml File

A key step in creating the application project is to configure the pom.xml file correctly to define the dependencies required to implement the client application. For more information see Define Dependencies in pom.xml.
That’s it. Happy googling with Google Drive.

Build GCP Cloud Storage Client

Leave a reply

The post demonstrates how to build a Google Cloud Storage client application in Java. This command line client app shows the basic logic to interact with Google Cloud Storage service and eliminates unnecessary clutter.

The application interacts with Google Cloud Storage via its JSON API using the related Google Java client library.

For more information, go to Google API Client Libraries then click on the Java link. In the menu bar click APIS, then enter Ctrl-F and search for storage, You will get this:

Click on the version link (v1, in the example). This will take you to the Cloud Storage JSON API Client Library for Java. Note that at the bottom of the page in the section “Add Library to Your Project”, there are several tabs. If you click the Maven tab, you get the dependency in JSON format to add to the pom.xml file in your project. This is an example:

  <dependency>
      <groupId>com.google.apis</groupId>
      <artifactId>google-api-services-storage</artifactId>
      <version>v1-rev111-1.22.0</version>
  </dependency>

The app uses a simple UI which allows the user to perform tasks such as: list the buckets in a project, list objects in a bucket, create a bucket, create an object and so on.

You can download the code at: gcp-storage-client. See also Import a Maven Project. Please, refer to README file for the latest example code information.

For background information, see GCP Cloud Storage Background.

Application Architecture

This section describes the application components and delegates the details to the actual code implementation. The following is the app architecture:

Main. This class is the application entry point. It performs the following tasks:
- Gets the authenticated client object authorized to access the Google Cloud Storage service.
- Reads the default settings.
- Instantiates the operations classes.
- Delegates to the SimpleUI class the display of the selection menu and the processing of the user’s input.
User Interface
- UserInterface. Abstract class that defines the variables and methods required to implement the SimpleUI class.
- SimpleUI. It extends the UserInterface class and performs the following tasks:
- Displays a selection menu for the user.
- Processes the user’s input and calls the proper method based on the user’s selection.
- Each method calls the related Google Cloud Storage JSON API.
Core Classes
- ProjectOperations. Contains methods to perform Google Cloud Storage project operations.
- BucketOperations. Contains methods to perform Google Cloud Storage bucket operations.
- ObjectsOperations. Contains methods to perform Google Cloud Storage object operations.
  - ObjectLoaderUtility. Performs object upload. This class is just a container. The actual work is done by the contained classes:
    - RandomDataBlockinputStream. Generates a random data block and repeats it to provide the stream for resumable object upload
    - CustomUploadProgressListener. Implements a progress listener to be invoked during object upload.
Authentication.
- GoogleServiceClientAuthentication. This is an abstract class which obtains the credentials for the client application to allow the use of the requested Google service REST API.
- IGoogleServiceClientAuthentication. Defines variables and methods to authenticate clients so they can use the selected Google service REST APIs.
- AuthenticateServiceClient. Creates an authenticated client object that is authorized to access the selected Google service API.

For more information, see Create Google Service Authentication App.

Utilities.
- IUtility. Defines fields and methods to implement the Utility class.
- Utility. Defines utility methods and variables to support the application operations such as menu creation, regions list initialization and so on.
- ServiceDefaultSettings. Reads the service client default settings from the related JSON file. The file contains information such as project ID, default e-mail and so on.

Application Workflow

The following figure shows the application time sequence (or workflow):

The first time the user starts the application, the Main class performs the following tasks:

Reads the default settings.
Creates authenticated storage service client.
Initializes the operation classes.
Initializes the SimpleUI class.
Starts the loop to process user inputs.

The SimpleUI class loops to process the user’s commands until she terminates the loop. At that point, the application terminates.

Application Implementation

Enable Google Cloud Storage API

To build the application, you will use Eclipse. Before you can do that, assure that you have enabled the service API as described next.

Follow the steps described in Enable Google Service API.
Download the client credentials information in a file (for example, client_secrets.json). Follow the steps described in Create OAuth Client Credentials.

Create the Application Project

In Eclipse, create a Maven project. For more information, see Create a Maven Project.
Add reference to the authentication app JAR file created in Build GCP Service Client Authentication. Alternatively, and a for quickest results, import the downloaded project. For more information, see Import a Maven Project

Modify the pom.xml File

Build GCP Service Client Authentication

Leave a reply

A client application must be authenticated to use any Google Cloud platform service through its REST API; a common and important first step for all the services. This post shows how to create a Java application which encapsulates the necessary authentication logic so you do not have to recreate it time and time again with the possibility of making mistakes. For simplicity, the example shows how to authenticate command line (aka, native) client applications and authorize their access to Google Cloud Platform services. At this time the app creates authenticated clients for the following services: Google Storage, Google Drive, YouTube, and BigQuery.

This post also contains important background information that you need to know to use Google Cloud service APIs. We suggest you take look before you proceed at Background Information.

Authentication App Architecture

The Authentication app is a Java application built as a Maven project. With Maven you can define all the up-to-date dependencies by linking to the necessary Google libraries on-line. For more information see GCP Cloud Service Client Apps – Common Tasks.

Find reference information for the Google APIs libraries at Supported Google APIs (Java) . Find latest info at the Maven Repository and search for the specific Google library

The authentication application described in this post has the following architecture:

IGoogleClientAuthentication. Defines variables and methods to authenticate clients so they can use Google service REST APIs.
GoogleServiceClientAuthentication. This is an abstract class which contains the actual logic to obtain the credentials for the client application so it can use the requested Google service REST API. The class uses Google OAuth 2.0 authorization code flow that manages and persists end-user credentials.
AuthenticateGoogleServiceClient. This class extends GoogleServiceClientAuthentication and implements IGoogleClientAuthentication. It creates an authenticated client object that is authorized to access the selected Google service API.
Based on the caller’ selection, it allows the creation of an authorized service to access Google service APIs such as Google Cloud Storage API or Google Drive API.

The class assumes that you already have created a directory to store the file with the client secrets. For example .googleservices/storage. The file containing the secrets is client_secrets.json.

Authentication App Workflow

The following figure shows the example application workflow:

The client application calls the authentication method for the service selected by the user passing the scope information. The AuthenticateGoogleServiceClient class performs all the steps to create an authenticated client that is authorized to use the Google service REST API, in particular it performs the following:

Reads the client secrets. You must store these secrets in a local file, before using the application You obtain the secretes through the Google developers console and downloading the related JSON information (for native applications) from your service project. The file name used in the example is client_secrets.json, you can use any other name as long as you use the json suffix. For details about the file name, directory names, see the code comments.
Uses Google OAuth2 to obtain the authorized service object. The first time you run the application, a browser instance is created to ask you as the project owner to grant access permission to the client. From then on, the credentials are stored in a file named StoredCredential. The name of this file is predefined in the StoredCredential class. This file is stored in the same directory where the client_secrets.json is stored. See the code comments for details. If you delete the StoredCredential file, the resource owner is asked to grant access again.
Google OAuth2 returns the authenticated service object to the AuthenticateGoogleServiceClient which, in turn, returns it to the client application. The client can then use the authenticated object to use the Google service REST API. For example, in case of the Google Storage service, it can list buckets in the project, create buckets, create objects in a bucket, list objects in a bucket and so on.

Background Information

Enable a Google Service API

In order to use a service API in your application, you must enable it as shown next.

Continue reading →

GCP Cloud Service Client Apps – Common Tasks

Leave a reply

The following are some common tasks that you must perform when using Google Cloud Service APIs such as enabling a service API, installing an API client library, performing client authentication, and so on.

Prerequisites

Eclipse Version 4.xx. Before installing Eclipse assure that you Java installed (at the least the JRE). To download Java development environment go to Java SE at a Glance.
Maven plugin installed. Make sure to set your Eclipse preferences as follows:
- Within Eclipse, select Window > Preferences (or on Mac, Eclipse > Preferences).
- Select Maven and select the following options:
  - “Download Artifact Sources”
  - “Download Artifact JavaDoc”

Create a Maven Project

In Eclipse, select File->New->Project. The Select a wizard dialog window is displayed
Expand the Maven folder and select Maven Project
Click Next.
In the next dialog window, check Create a simple project (skip archetype selection).
Click Next. The New Maven project dialog is displayed.
Enter the Group Id information, for instance com.clientauth.
Enter the Artifact Id (use the name of the project) for instance ClientAuth.
Accept the Version default 0.0.1-SNAPSHOT. Or assign a new version such as 1.0.
Assure that the Packaging is jar.
Enter the project name, for example ClientAuthentication.
Click Finish.
This creates a default pom.xml file that you will use to define your application dependencies as shown next.

Define Dependencies in pom.xml

To the default pom.xml, you must add the dependencies specific to your application. The example shown next refers to a console application which uses Google Storage service. To obtain the latest dependencies (aka artifacts) information, perform the following steps:

OAuth2 API Dependency

In your browser, navigate to https://developers.google.com/api-client-library/java/apis/.
In the page, click Ctrl-F and in the box enter oauth2. This will take you to the row containing the OAuth2 library info.
Click on the version link, let’s say v2. This displays the Google OAuth2 API Client Library for Java page.

At the bottom, in the Add Library to Your Project section, click on the Maven tab. This displays the dependencies information similar to the following:

<project>
  <dependencies>
    <dependency>
      <groupId>com.google.apis</groupId>
      <artifactId>google-api-services-oauth2</artifactId>
      <version>v2-rev126-1.22.0</version>
    </dependency>
  </dependencies>
</project>

Copy and paste the <dependency> section in the pom.xml file.
If you want to refer to other versions of the API library click on the link at the bottom of the page. See all versions available on the Maven Central Repository.

You can define the version in a parametric way as follows:

<version>${project.oauth.version}</version>

Where the

${project.oauth.version}

is defined in the properties section as follows:

<properties>
 <project.http.version>1.22.0</project.http.version>
 <project.oauth.version>v2-rev126-1.22.0</project.oauth.version>
 <project.storage.version>v1-rev105-1.22.0</project.storage.version>
 <project.guava.version>21.0</project.guava.version>
 <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

So the new format is as follows:

<dependency>
  <groupId>com.google.apis</groupId>
  <artifactId>google-api-services-oauth2</artifactId>
  <version>${project.oauth.version}</version>
</dependency>

Guava Dependency

Guava is a suite of core and expanded libraries that include utility classes, google’s collections, io classes, and much much more.

In your browser, navigate to https://mvnrepository.com/.
In the search box, enter the name of the API library google guava.
Click on the tile of the library Guava: Google Core Libraries For Java.
In the displayed page click on the required version.
Click on the Maven tab.
Check the Include comment …. box
Click on the box. This will copy the content to the clipboard.
Paste the content in the pom file

Managing Dependencies

The Guava library version might conflict with the OAuth2 library version. In order to avoid the conflict we need to add a dependencyManagement section to the pom.xml file. Follow these steps:

In Eclipse, in the pom.xml editor, click on the Dependencies tag.
Click on the Manage button.
In the left pane, select the Guava and OAuth libraries.

Click the Add button. This create the dependencyManagement section. The following shows an example:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.google.apis</groupId>
      <artifactId>google-api-services-oauth2</artifactId>
      <version>${project.oauth.version}</version>
   </dependency>
   <dependency>
     <groupId>com.google.guava</groupId>
     <artifactId>guava</artifactId>
     <version>${project.guava.version}</version>
   </dependency>
 </dependencies>
</dependencyManagement>

HTTP Dependency

This library is needed to allow a Java application to make HTTP asynchronous requests over the network through the REST API of the cloud service it uses for example Google Storage.

In your browser, navigate to https://mvnrepository.com/.
In the search box, enter the name of the library google http client.
Click on the tile of the library Google HTTP Client Library For Java.
In the displayed page click on the required version.
Click on the Maven tab.
Check the Include comment …. box
Click on the box. This will copy the content to the clipboard.
Paste the content in the pom file

Jackson Extensions to HTTP Library Dependency

This library is needed to allow a Java application to perform XML and JSON parsing.

In your browser, navigate to https://mvnrepository.com/.
In the search box, enter the name of the library google http client.
Click on the tile of the library Jackson 2 Extensions To The Google HTTP Client Library For Java.
In the displayed page click on the required version (the same you used for the HTTP library).
Click on the Maven tab.
Check the Include comment …. box
Click on the box. This will copy the content to the clipboard.
Paste the content in the pom file

Google Storage API Dependency

In your browser, navigate to https://developers.google.com/api-client-library/java/apis/.
In the page, click Ctrl-F and in the box enter cloud storage. This will take you to the row containing the Cloud Storage library info.
Click on the version link, let’s say v1. This displays the Cloud Storage JSON API Client Library for Java page.
At the bottom, in the Add Library to Your Project section, click on the Maven tab.
Copy and paste the dependency section in the pom.xml file.

Once you have updated the pom, make sure to update the project by right-clicking on the project name then selecting
Maven->Update Project…

Import a Maven Project

Download the archived project from the specified location.
Unzip the downloaded archive.
In Eclipse, create a work space or use an existing one.
Click OK.
Click File->Import.
In the wizard window, select Maven->Existing Maven Projects.
Click Next.
Navigate (click the Browse… button), to the location containing the unzipped code archive. The following is an example of a project to import:
Click OK. You get a window similar to this:
Click Finish.

What Can Go Wrong?

Local JARs

You may have local JARs that must be added to the project path. If they are not included you can have errors similar to this: .

To solve this kind of problems perform the following steps:

In Eclipse, in the Package Explorer, right click on the project name.
Navigate to Properties->Java Build Path.
Click on the Libraries tag.
Click the Add JARs… button
Select your local JAR, from the lib folder for example, and click OK.
You will get a window similar to the following:
Click OK.
The error should disappear from the list in the Problems window.

Execution Environment

You could get a warning about the execution environment similar to the following:

To solve this kind of problems perform the following steps:

In Eclipse, in the Package Explorer, right click on the project name.
Navigate to Properties->Java Build Path.
Click on the Libraries tag.
Select the current JRE System Library.
Click the Remove button.
Click the Add Library… button.
Select the JRE System Library.
Click Next.
Click Finish. The new JRE System Library version should be listed.
Click OK.
The warning should disappear from the list in the Problems window.

Compiler Version

You could get an error about the compiler version similar to the following:

To solve this kind of problems perform the following steps:

In Eclipse, in the Package Explorer, right click on the project name.
Navigate to Properties->Java Compiler.
In the right pane, uncheck Enable project specific settings.
Click the link Configure Workspace Settings….
In the next window, select version 1.8 or higher.
Check Use default compliance settings.
Click OK.
Click OK.
Click Yes, in the popup asking to recompile the project.
The error should disappear from the list in the Problems window.

Create Runnable JAR

In Eclipse, in the Package Explorer, right click on the project name.
Click Export.
Expand the Java folder.
Click Runnable JAR file.
Click Next.
In the Launch configuration, select the one applicable to the project.
This is the configuration you define to run the application in Eclipse.
In the Export destination enter or browse to the location where to store the JAR and enter the name for the JAR file.
Click Finish.
To execute the application, open a terminal window.
Change the directory where the JAR is located.
Enter a command similar to the following:
```
  java -jar google-drive-client.jar
```

AWS Elastic Map Reduce Quick Start – Dashboard

Leave a reply

This post provides essential instructions on how to get started with Amazon Elastic MapReduce (Amazon EMR). You will learn how to create a sample Amazon EMR cluster by using the AWS Management Console. You then run a Hive script to process data stored in Amazon S3.

The instructions in this example do not apply to production environments and they do not cover in depth configuration options. The example shows how to quickly set up a cluster for evaluation purposes. For questions or issues you can reach out to the Amazon EMR team by posting on the Discussion Forum.

Cost

The sample cluster that you create runs in a live environment and you are charged for the resources used. This example should take an hour or less, so the charges should be minimal. After you complete this example, you should reset your environment to avoid incurring further charges.For more information, see Reset EMR Environment.

Pricing for Amazon EMR varies by region and service. For this example, charges accrue for the Amazon EMR cluster and Amazon Simple Storage Service (Amazon S3) storage of the log data and output from the Hive job. If you are within your first year of using AWS, some or all of your charges for Amazon S3 might be waived if you are within your usage limits of the AWS Free Tier.
For more information about Amazon EMR pricing and the AWS Free Tier, go to Amazon EMR Pricing and AWS Free Tier.

You can use the Amazon Web Services Simple Monthly Calculator to estimate your bill.

Sample EMR Cluster Prerequisites

The following are the preliminary steps you must perform to complete the example.

Create an AWS account.
Create an S3 bucket.
The example in this topic uses an S3 bucket to store log files and output data.
Due to Hadoop constraints, the bucket name should conform to these requirements:
- It must contain lower case letter, numbers, periods and hyphens.
- It cannot end with a number.
  Example: mycompany.username.vernumber-emr-quickstart.
Click on the S3 bucket name. The bucket page is displayed.
Create 2 folders named: logs and output respectively.
Make sure that the output folder is empty. For more information, see Creating a Folder.
Create an Amazon EC2 Key Pair.
You need the key pair to connect to the nodes in the cluster.

Launch the Sample Amazon EMR Cluster

In your browser, navigate to the Amazon management console.
In the Analytics section click on EMR. The console dashboard is displayed.
Click the Create cluster button.
The Create Cluster – Quick Options page is displayed.
For more information, see Using Quick Cluster Configuration Overview.
Accept the default values except for the following fields:
- In the Cluster name box, enter any name that has meaning to you
- For the S3 folder box, click on the folder icon to select the path to the logs folder that you created.
- For the EC2 key pair box, from the drop-down list, choose the key pair that you created.
Click the Create cluster button.

AWS Elastic Map Reduce (EMR)

2 Replies

Amazon Elastic MapReduce (Amazon EMR) is a web service that makes it easy to quickly and cost-effectively process vast amounts of data. Amazon EMR simplifies big data processing, providing a managed Hadoop framework that makes it easy, fast, and cost-effective for you to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.You can also run other popular distributed frameworks such as Apache Spark and Presto (SQL Query Engine) in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. For a quick overview, see Introduction to Amazon Elastic MapReduce.

Background

Amazon EMR enables you to quickly and easily provision as much computing capability as you need and add or reduce or remove it at any time. This is very important when dealing with variable or unpredictable processing requirements as it is often the case with big data processing.
For example, if the bulk of your processing occurs at night, you might need 100 virtual machine instances during the day and 500 instances at night. Or you might need a significant computing peak for a short period of time. With Amazon EMR you can quickly provision hundreds or thousands of instances, and release them when the work is completed. saving on the overall cost.

Computing Capacity

The following are some possible way to control computing capacity:

More...

Deploy Multiple Clusters. If you need more capacity, you can easily launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. Use multiple clusters to accommodate multiple users or applications. For example, you can store your input data in Amazon S3 and launch one cluster for each application that needs to process the data. One cluster might be optimized for CPU, a second cluster might be optimized for storage, etc.
Resize Running Cluster. With Amazon EMR it is easy to resize a running cluster. You may want to resize a cluster to temporarily add more processing power to the cluster, or shrink it to save on costs. You could add hundreds of instances to the cluster when a batch processing occurs, and remove the extra instances when the processing is completed. When adding instances to your cluster, EMR can start utilizing provisioned capacity as soon it becomes available and you can shrink the cluster with minimum impact to running jobs. For more information, see resize running cluster.

Cost

Amazon EMR is designed to reduce the cost of processing large amounts of data. The following are some of the features to control operational costs:

Click here to expand…

Low Hourly Price. Pricing is per instance hour and starts at $.015 per instance hour for a small instance ($131.40 per year). For more information, see Amazon EMR Pricing.
Amazon EC2 Reserved Instance Integration. Amazon EC2 reserved Instances enable you to maintain the benefits of elastic computing while lowering costs and reserving capacity. With reserved Instances you pay a low, one-time fee and in turn receive a significant discount on the hourly charge for that instance. Amazon EMR makes it easy to utilize reserved Instances so you can save up to 65% off the on-demand price. For more information, see Ensure Capacity with Reserved Instances.
Elasticity. Amazon EMR makes it easy to add and remove computing power as a result you do not need to provision more capacity than what required. You may not know how much data your cluster(s) must handle in 6 months, or you may have peaks in processing needs. You can easily add or remove capacity at any time.
S3 Integration. EMR clusters efficiently and securely use Amazon S3 as an object store for Hadoop through its EMR File System (EMRFS). You can store your data in Amazon S3 and use multiple Amazon EMR clusters to process the same data set. Each cluster can be optimized for a particular workload, which can be more efficient than a single cluster serving multiple workloads with different requirements. For example, you might have one cluster that is optimized for I/O and another that is optimized for CPU, each processing the same data set in Amazon S3. In addition, by storing your input and output data in Amazon S3, you can shut down clusters when they are no longer needed.

Data Stores

Amazon EMR allows you to use different types of data stores as described next.

More...

Amazon S3.Through the EMR File System (EMRFS). Amazon EMRcan efficiently and securely use Amazon S3 as an object store for Hadoop. When you launch your cluster, Amazon EMR streams the data from Amazon S3 to each instance in your cluster and begins processing it immediately. One advantage of storing your data in Amazon S3 and processing it with Amazon EMR is you can use multiple clusters to process the same data. For example, you might have a Hive development cluster that is optimized for memory and a Pig production cluster that is optimized for CPU both using the same input data set.
Hadoop Distributed File System (HDFS). This is the Hadoop file system. In Amazon EMR, HDFS uses local ephemeral storage. Depending on the instance type, this could be spinning disks or solid state drives. Every instance in your cluster has local ephemeral storage, but you decide which instances run HDFS. Amazon EMR refers to instances running HDFS as core nodes and instances not running HDFS as task nodes. For more information, see Hadoop Distributed File System.
Amazon DynamoDB. Amazon EMR has direct integration with Amazon DynamoDB which is a fast, fully managed NoSQL database service. You can quickly and efficiently process data stored in Amazon DynamoDB and transfer data between Amazon DynamoDB, Amazon S3, and HDFS in Amazon EMR. For more information, see Amazon DynamoDB

Hadoop Tools

Amazon EMR supports proven Hadoop tools such as Hive, Pig, HBase, and Impala. Additionally, it can run distributed computing frameworks besides Hadoop MapReduce such as Spark or Presto using bootstrap actions. You can also use Hue and Zeppelin as GUIs for interacting with applications on your cluster.

More...

Hive. Open source data warehouse and analytics package that runs on top of Hadoop.. Hive is operated by Hive QL, a SQL-based language which allows users to structure, summarize, and query data. Hive QL goes beyond standard SQL by adding first-class support for map/reduce functions and complex extensible user-defined data types like JSON and Thrift.This capability allows processing of complex and unstructured data sources such as text documents and log files.Hive allows user extensions via user-defined functions written in Java.
Amazon EMR has made numerous improvements to Hive, including direct integration with Amazon DynamoDB and Amazon S3. For example, you can load table partitions automatically from Amazon S3, you can write data to tables in Amazon S3 without using temporary files, and you can access resources in Amazon S3 such as scripts for custom map/reduce operations and additional libraries. For more information, see Apache Hive.
Pig. Open source analytics package that runs on top of Hadoop. Pig is operated by Pig Latin, a SQL-like language which allows users to structure, summarize, and query data. Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types.This capability allows processing of complex and unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java.
Amazon EMR has made numerous improvements to Pig, including the ability to use multiple file systems (normally Pig can only access one remote file system), the ability to load customer JARs and scripts from Amazon S3 and additional functionality for String and DateTime processing. For more information, see Apache Pig.
HBase. Open source, non-relational, distributed database modeled after Google’s BigTable. It runs on top of Hadoop Distributed File System (HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).
In Amazon EMR, you can back up HBase to Amazon S3 (full or incremental, manual or automated) and you can restore from a previously created backup. For more information, see Apache HBase.
Impala. Open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS). With this architecture, you can query your data in HDFS or HBase tables very quickly, and leverage Hadoop’s ability to process diverse data types and provide schema at runtime.This allows for interactive, low-latency analytics. Also, it supports user defined functions in Java and C++, and can connect to BI tools through ODBC and JDBC drivers. Impala uses the Hive metastore to hold information about the input data, including the partition names and data types.For more information, see Impala
Hue. Open source user interface for Hadoop that makes it easier to run and develop Hive queries, manage files in HDFS, run and develop Pig scripts, and manage tables.
In Amazon EMR, Hue is integrated with Amazon S3, so you can query directly against S3 and easily transfer files between HDFS and Amazon S3. For more information, see Hue
Spark. Hadoop engine for fast processing of large data sets. It uses in-memory, fault-tolerant resilient distributed data sets (RDDs) and directed, acyclic graphs (DAGs) to define data transformations. Spark also includes Spark SQL, Spark Streaming, MLlib, and GraphX. For more information, see Apache Spark on Amazon EMR.
Presto. Open-source distributed SQL query engine optimized for low-latency, ad-hoc analysis of data. It supports the ANSI SQL standard, including complex queries, aggregations, joins, and window functions. Presto can process data from multiple data sources including the Hadoop Distributed File System (HDFS) and Amazon S3. For more information, see Presto on Amazon EMR.
Zeppelin. Open source GUI which creates interactive and collaborative notebooks for data exploration using Spark. You can use Scala, Python, SQL (using Spark SQL), or HiveQL to manipulate data and quickly visualize results. Zeppelin notebooks can be shared among several users, and visualizations can be published to external dashboards. For more information, see Amazon EMR Sandbox Applications.
Oozie. Workflow scheduler for Hadoop, where you can create Directed Acyclic Graphs (DAGs) of actions. Also, you can easily trigger your Hadoop workflows by actions or time. For more information, see Amazon EMR Sandbox Applications.
Other. EMR supports a variety of other popular applications and tools, such as R, Mahout (machine learning), Ganglia (monitoring), Accumulo (secure NoSQL database), Sqoop (relational database connector), HCatalog (table and storage management), and more.
The Amazon EMR team maintains an open source repository of bootstrap actions on github that you can use to install additional software, configure your cluster, or serve as examples for writing your own bootstrap actions.

References

Build AWS EC2 Client

1 Reply

This topic shows how to create a Java console application to interact with Amazon EC2 by using AWS Java SDK. For more information, see Using the AWS SDK for Java. This is a command line client that eliminates unnecessary clutter and shows the basic logic to interact with Amazon EC2 . Hopefully this will help you to understand the syntax (and semantic ) of the API.

A separate project handles the creation of an EC2 authenticated client which is allowed to access the EC2 service REST API.

You can download the code at: aws-ec2-client and the related documentation at: aws-ec2-client-docs. Please, refer to the README file for the latest example code information. See also Import a Maven Project. You must also download the companion project at aws-auth-client and include it in the client app project. You can download the related documentation at: aws-auth-client-docs.

Application Internals

The following figure shows the application event trace:
aws ec2 client
A simple UI allows the user to perform tasks such as: create EC2 instances, list instances, assign instance attributes and so on.
The first time the user starts the application, the Main class performs the following tasks:

Creates an authorized EC2 client
Initializes the EC2 operations class
Initializes the SimpleUI class
Starts the loop to process the user’s input

The SimpleUI class loops to process the user’s input until the loop is exited. At that point, the application terminates.

Modify the pom.xml File

A key step in creating the application project is to configure the pom.xml file correctly to define the dependencies required to implement the client application. You can find the file at: pom.xml.

Application Components

This section describes the components of the application and delegates the details to the actual code implementation.

Main. Instantiates the authenticated EC2 service client, initializes the operations and the UI classes.
SimpleUI. Displays the menu of choices for the user. It processes the user’s input and calls the proper function based on the user’s selection. Each function calls the related AWS EC2 Java library method (which in turn calls the related REST API).
UserInterface. Defines the attributes and methods required to implement the SimpleUI} class.
AwsClientAuthentication. Creates an authenticated client which is allowed to use the EC2 API.
IEC2Client. Defines fields and methods to implement the Ec2ClientAuthentication class.
IUtility. Defines fields and methods to implement the Utility class.
Utility. Defines utility methods and variables to support the application operations such as menu creation, regions list initialization and so on.

EC2Operations. Performs EC2 operations selected by the user. The various methods call the related EC2 library functions that in turn call the REST APIs which interact with the EC2 service. The following example code shows how to get available instances associated with a specific key pair.

public static void getInstancesInformation(String keyName) {
        List&lt;Instance&gt; resultList = new ArrayList&lt;Instance&gt;();
        DescribeInstancesResult describeInstancesResult = ec2Client.describeInstances();
        List&lt;Reservation&gt; reservations = describeInstancesResult.getReservations();
        for (Iterator&lt;Reservation&gt; iterator = reservations.iterator(); iterator.hasNext();) {
            Reservation reservation = iterator.next();
            for (Instance instance : reservation.getInstances()) {
                if (instance.getKeyName().equals(keyName))
                    resultList.add(instance);
            }
        }
        displayInstancesInformation(resultList);
    }

Security Access Credentials

You need to set up your AWS security credentials before the sample code is able to connect to AWS. You can do this by creating a file named “credentials” at ~/.aws/ (C:\Users\USER_NAME.aws\ for Windows users) and saving the following lines in the file

[default]
    aws_access_key_id = <your access key id>
    aws_secret_access_key = <your secret key>

For more information, see Providing AWS Credentials in the AWS SDK for Java.

References

Build AWS S3 Client

Leave a reply

This post shows how to create a Java console application to interact with Amazon S3 by using AWS Java SDK. For more information, see Using the AWS SDK for Java. A simple UI allows the user to perform tasks such as list the buckets in the account, list objects in a bucket, create a bucket, create an object and so on.

This is a command line client that eliminates unnecessary clutter and shows the basic logic to interact with Amazon S3. Hopefully, this will help you to understand the syntax (and semantic ) of the API.

A separate project handles the creation of an S3 authenticated client which is allowed to access the EC2 service REST API.

You can download the code here: aws-s3-client. Please, refer to the README file for the latest example code information. You must also download the companion project at aws-auth-client and include it in the client app project.

Prerequisites

📝 You must have Maven installed. The dependencies are satisfied by building the Maven package.
– 🚨 Also, assure to download the [aws-client-auth](https://github.com/milexm/aws-client-auth) project and include it in this client app project.
– 📝 If you use Eclipse to build the application (why not?) follow the steps describe at: GCP Cloud Service Client Apps – Common Tasks.

Application Internals

Application Class Diagram

The following is the application class diagram.

Application Workflow

The following figure shows the application event trace.

aws s3 client event trace

The first time the user starts the application, the Main class performs the following actions:

Creates an authorized S3 client
Initializes the operations classes
Initializes the SimpleUI class
Starts the loop to process user inputs

The SimpleUI class loops to process the user’s commands until the loop is exited. At that point, the application terminates.

Modify the pom.xml File

Application Components

This section describes the components of the application and delegates the details to the actual code implementation.

Main. Gets authorization to access the S3 service, initializes the command classes. Delegates to the SimpleUI class the display of the selection menu and the processing of the user’s input.
SimpleUI. Displays the menu of choices for the user. It processes the user’s input and calls the proper function based on the user’s selection. Each function calls the related AWS S3 Java library method (which in turn calls the related REST API).
AwsClientAuthentication. Creates Amazon S3 authenticated client.

BucketOperations. Contains methods to perform S3 Bucket operations. The following code shows how to create a bucket, for example.

public static void CreateBucket(String bucketName) throws IOException {        

  try {
        System.out.println("Creating bucket " + bucketName + "\n");
        // Create the bucket.
          s3Client.createBucket(bucketName);
      }
      catch (AmazonServiceException ase) {
        StringBuffer err = new StringBuffer();
        err.append(("Caught an AmazonServiceException, which means your request made it "
                     + "to Amazon S3, but was rejected with an error response for some reason."));
        err.append(String.format("%n Error Message:  %s %n", ase.getMessage()));
        err.append(String.format(" HTTP Status Code: %s %n", ase.getStatusCode()));
        err.append(String.format(" AWS Error Code: %s %n", ase.getErrorCode()));
        err.append(String.format(" Error Type: %s %n", ase.getErrorType()));
        err.append(String.format(" Request ID: %s %n", ase.getRequestId()));
     }
     catch (AmazonClientException ace) {
            System.out.println("Caught an AmazonClientException, which means the client encountered "
              + "a serious internal problem while trying to communicate with S3, "
              + "such as not being able to access the network.");
            System.out.println("Error Message: " + ace.getMessage());
     }
}

ObjectOperations . Contains methods to perform S3 Object operations. The following code shows how to list objects in a bucket, for example.

   public static void listObject(String bucketName) throws IOException {          

        try {
                System.out.println("Listing objects");

                ListObjectsRequest listObjectsRequest = new ListObjectsRequest()
                    .withBucketName(bucketName)
                    .withPrefix("m");
                ObjectListing objectListing;
                do {
                    objectListing = s3Client.listObjects(listObjectsRequest);
                    for (S3ObjectSummary objectSummary :
                        objectListing.getObjectSummaries()) {
                        System.out.println(" - " + objectSummary.getKey() + "  " +
                                "(size = " + objectSummary.getSize() +
                                ")");
                    }
                    listObjectsRequest.setMarker(objectListing.getNextMarker());
                } while (objectListing.isTruncated());
        }
        catch (AmazonServiceException ase) {
            StringBuffer err = new StringBuffer();

            err.append(("Caught an AmazonServiceException, which means your request made it "
                  + "to Amazon S3, but was rejected with an error response for some reason."));
            err.append(String.format("%n Error Message:  %s %n", ase.getMessage()));
            err.append(String.format(" HTTP Status Code: %s %n", ase.getStatusCode()));
            err.append(String.format(" AWS Error Code: %s %n", ase.getErrorCode()));
            err.append(String.format(" Error Type: %s %n", ase.getErrorType()));
            err.append(String.format(" Request ID: %s %n", ase.getRequestId()));

        }
        catch (AmazonClientException ace) {
            System.out.println("Caught an AmazonClientException, which means the client encountered "
                + "a serious internal problem while trying to communicate with S3, "
                + "such as not being able to access the network.");
            System.out.println("Error Message: " + ace.getMessage());
        }
    }

Security Access Credentials

🚨 You need to set up your AWS security credentials before the sample code is able to connect to AWS. You can do this by creating a file named “credentials” in the ~/.aws/ directory on Mac (C:\Users\USER_NAME.aws\ on Windows) and saving the following lines in the file:

[default]
    aws_access_key_id = &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;your access key id&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;;
    aws_secret_access_key = &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;your secret key&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;;

For information on how to create security credentials, see Create Access Credentials. See also Providing AWS Credentials in the AWS SDK for Java.
References
Getting Started with the AWS SDK for Java
Providing AWS Credentials in the AWS SDK for Java
Amazon S3 Documentation
Working with Amazon S3 Buckets
Working with Amazon S3 Objects
AWS Toolkit for Eclipse
Java Development Blog

GCP Cloud Storage Background

Leave a reply

Google Cloud Storage (GCS) is an Infrastructure As Service (IasS) for storing and accessing
customers data. The service combines the performance and scalability of Google’s cloud
with advanced security and sharing capabilities.
GCS provides a simple programming interface through standard HTTP methods PUT, GET,
POST, HEAD, and DELETE to store, share and manage data in the cloud. In this way, you
don’t have to rely on complicated SOAP toolkits or RPC programming.

Google Cloud Storage Architecture

Let’s analyze the GCS main architectural components, as shown in the next figure, to gain
an understanding of GCS inner working and capabilities.

To use Google Cloud Storage effectively you need to understand some of the concepts on
which it is built. These concepts define how your data is stored in Google Cloud Storage.

Projects

All data in Google Cloud Storage belongs inside a project. A project consists of a set
of users, a set of APIs, and billing, authentication, and monitoring settings for those
APIs. You can have one project or multiple projects.

Buckets

Buckets are the basic containers that hold your data. Everything that you store in
Google Cloud Storage must be contained in a bucket. You can use buckets to
organize your data and control access to your data, but unlike directories and
folders, you cannot nest buckets.

Bucket names. Bucket names must across the entire Google Cloud Storage and have more restrictions than object names because every bucket resides in a single Google Cloud Storage namespace. Also, bucket names can be used with a CNAME redirect, which means they need to conform to DNS naming conventions. For more information, see Bucket and Object Naming Guidelines .

Objects

Objects are the individual pieces of data that you store in Google Cloud Storage. Objects have two components: object data and object metadata . The object data component is usually a file that you want to store in Google Cloud Storage. The object metadata component is a collection of name-value pairs that describe various object qualities.

Object names. An object name is just metadata to Google Cloud Storage. The following are the main properties:
- An object name can contain any combination of Unicode characters (UTF-8 encoded) less than 1024 bytes in length.
- An object name must be unique within a given bucket.
- A common character to include in file names is a slash (/). By using slashes in an object name, you can make objects appear as though they’re stored in a hierarchical structure. For example, you could name an object /europe/france/paris.jpg and another object /europe/france/cannes.jpg. When you list these objects they appear to be in a hierarchical directory structure based on location; however, Google Cloud Storage sees the objects as independent objects with no hierarchical relationship whatsoever.
Object Immutability. Objects are immutable, which means that an uploaded object cannot change throughout its storage lifetime. An object’s storage lifetime is the time between successful object creation (upload) and successful object deletion. In practice, this means that you cannot make incremental changes to objects, such as append operations or truncate operations. However, it is possible to overwrite objects that are stored in Google Cloud Storage because an overwrite operation is in effect a delete object operation followed immediately by an upload object operation. So a single overwrite operation simply marks the end of one immutable object’s lifetime and the beginning of a new immutable object’s lifetime.
Data opacity. An object’s data component is completely opaque to Google Cloud Storage. It is just a chunk of data to Google Cloud Storage.
Hierarchy. Google Cloud Storage uses a flat hierarchical structure to store buckets and objects. All buckets reside in a single flat hierarchy (you can’t put buckets inside buckets),
and all objects reside in a single flat hierarchy within a given bucket.
Namespace. There is only one Google Cloud Storage namespace, which means:
- Every bucket must have a unique name across the entire Google Cloud
  Storage namespace.
- Object names must be unique only within a given bucket.

Google Cloud Storage Characteristics

When you store your data on Google Cloud Storage, the service does all the background work to make data operations fast so you can focus on your application. The following are the main reasons:

GCS is built on Google’s proprietary network and datacenter technology. Google spent several years building proprietary infrastructure and technology to power Google’s sites (after all, fast is better than slow). When you use GCS, the same network goes to work for your data.
GCS replicates data to multiple data centers and serves end-user’s requests from the nearest data center that holds a copy of the data. You have a choice of regions (currently U.S. and Europe) to allow you to keep your data close to where it is most needed. Data is also replicated to different disaster zones to ensure high availability.
GCS takes the replication one step further. When you upload an object and mark it as cacheable (by setting the standard HTTP Cache-Control header), GCS automatically figures out how best to serve it using Google’s broad network infrastructure, including caching it closer to the end-user if possible.
Last but not least, you don’t have to worry about optimizing your storage layout (like you would on a physical disk), or the lookups (i.e. directory and naming structure) like you would on most file systems and some other storage services. GCS takes care of all the “file system” optimizations behind the scenes.

Performance Considerations

When you select a service, one of the most important things to consider is its performance. The performance of a cloud storage service (or any cloud service for that matter) depends on two main factors:

The network that moves the data between the service and the end user.
The performance of the storage service itself.

Network

A key performance factor is the network path between the user’s location and the cloud service provider’s data centers. This path is critical because if the network is slow or unreliable, it doesn’t really matter how fast the service is. These are two main ways to make the network faster:

Serve the request from a center as close as possible to the user’s location.
Optimize the network routing between the user’s location and the data center.

Storage

The other performance factor is how quickly the data center processes a user’s request. This mainly implies the following:

Data must be managed optimally.
The request must be processed as fast as possible.

In a way, a cloud storage service is similar to a big distributed file system that performs the following tasks as efficiently as possible:

Checks authorization.
Locks the object (data) to access.
Reads the requested data from the physical storage medium.
Transfers data to the user.

For an example of an application using GCS, see Build GCP Cloud Storage Client.

Create a Repository in the Project Directory

Configure Push Operation

Merge Local and Remote Repositories

Application Architecture

Application Workflow

Application Implementation

Enable Google Drive API

Create the Application Project

Modify the pom.xml File

Application Architecture

Application Workflow

Application Implementation

Enable Google Cloud Storage API

Create the Application Project

Modify the pom.xml File

Authentication App Architecture

Authentication App Workflow

Background Information

Enable a Google Service API

Prerequisites

Create a Maven Project

Define Dependencies in pom.xml

OAuth2 API Dependency

Guava Dependency

Managing Dependencies

HTTP Dependency

Jackson Extensions to HTTP Library Dependency

Google Storage API Dependency

Import a Maven Project

What Can Go Wrong?

Local JARs

Execution Environment

Compiler Version

Create Runnable JAR

See Also

Cost

Sample EMR Cluster Prerequisites

Launch the Sample Amazon EMR Cluster

Background

Computing Capacity

Cost

Data Stores

Hadoop Tools

References

Application Internals

Modify the pom.xml File

Application Components

Security Access Credentials

References

Prerequisites

Application Internals

Application Class Diagram

Application Workflow

Modify the pom.xml File

Application Components

Security Access Credentials

References

Google Cloud Storage Architecture

Projects

Buckets

Objects

Google Cloud Storage Characteristics

Performance Considerations

Network

Storage