Category Archives: GCP BigQuery

BigQuery for First Time Users

BigQuery is a big data analytics service that is hosted on the Google Cloud Platform.  You can analyze large datasets by simply loading data into BigQuery and then executing SQL like queries to gain analytic insights on your data. For more information, see  Loading Data Into BigQuery and Query Reference.

You can load data directly from your workstation but you will be limited in the size of the files to upload.  A preferred way is to use Google Cloud Storage as a staging area where you can take full advantage of the service capabilities such as unlimited file size, data replication, backup and so on. For example, you can use Cloud Storage to store historical log data and then take snapshots of this data to load into BigQuery for analytical purposes. BigQuery is integrated seamlessly with Google Cloud Storage as we’ll see in the example section.

BQ_DataIO

 

Google Cloud Storage is typically used to store raw data before uploading it into BigQuery, this way you can always have access to this data if you want to reload, mashup and so on. You can also export BigQuery data to Google Cloud Storage; for more information, see Exporting Data From BigQuery.

You can access BigQuery in one of the following ways:

  • BigQuery browser interface that you can use to import and export data, run queries, and perform other user and management tasks using the browser. For more information, see BigQuery Browser.
  • Command line tool (bq) tool that you can use to perform BigQuery operations from the command line. For more information, see Command-line Tool.
  • BigQuery API. RESTful API that you can use to access BigQuery programmatically from your application.  For more information, see API Reference.  

Third-party tools are also available for loading, transforming and visualizing data. For more information, see Third-party Tools.  

BigQuery Terms

Google Cloud Platform uses a project-centric design.  Most functionality will be tied to the project. Components of a project (called “services”) will be able to hold budgets and Access Control Lists (ACLs), for example. The following is a brief description of the terminology you should be familiar with while using BigQuery and Google Cloud Storage.

  • Projects.  All data in BigQuery belongs inside a project. A project consists of a set of users, a set of APIs, billing, Access Control Lists (ACLs) that determine access to the Datasets and the Jobs, and monitoring settings for those APIs. You can have one project or multiple projects. Projects are created and managed using the Google APIs Console. For information about the related API type, see Projects.   
  • Datasets. A dataset is a grouping mechanism that holds zero or more tables. A dataset is the lowest level unit of access control. You cannot control access at the table level. A dataset is contained within a specific project. Each dataset can be shared with individual users. Datasets are also referenced in the SQL statements when interacting with BigQuery.  For information about the related API type, see  Datasets.
  • Tables.  Row-column structures that contain actual data. They belong to a Dataset.  You query the data at the table-level, but you cannot control access at this level, you do it at the Dataset level. For information about the related API type, see Tables.  
  • Jobs. Job are is a mechanism by which to schedule tasks such as query, loading data, exporting data.  Usually a job is used to start potentially long-running tasks. Shorter tasks, such as list or get requests, are not managed by a job resource. For  information about the related API type, see  Jobs.

 

BQ_DataSets_Projects

For more information, see Overview and Managing Jobs, Datasets and Projects.

See Also

Load and Query Data Using BigQuery Browser

To perform the steps described in this section, you must own or have access to a Google Cloud Platform project with BigQuery API enabled.   You need credentials to access APIs.

Enable the APIs you plan to use and then create the credentials they require. Depending on the API, you need an API key, a service account, or an OAuth 2.0 client ID. Refer to the API documentation for details. For more information, see Google Cloud API Common Tasks.

Prerequisites

Prepare and load the sample data.

  1. Obtain the data. The US Government keeps a record of names given to newborn babies for each year going back to 1880. You can download the data from Popular Baby Names. The zip archive contain several comma separated value (CSV) files (one for each year).  The file names are in the format “yobNNNN.txt”.  You can choose any file you like. For this example we’ll use the file yob2010.txt that contains the most popular names for the year 2010.
  2. Unzip the archive. The format for each CSV file is name, gender, count.  Where gender is either “M” or “F”, and count is the number of children given that name.  Remember this format, because you will  use it for the schema when you create your table.
NoteThe sample data is already in comma separated value (CSV) format. One of the formats accepted by BigQuery, the other being JSON. So no data preparation is required before loading it into BigQuery.

Create a Dataset

  1. Launch the Big Query Browser.
  2. In the left panel,  select your project.
  3. Next to your project name, click the down arrow .BQ_Create_Dataset
  4. In the popup menu, click  Create new dataset.
  5. In the dialog box, for the Dataset ID enter the name of the dataset, for example: PopularNames.BQ_Popular_Names_Create
  6. Click OK.
    The new dataset is created and listed under the project in the left panel.

Create and Populate a Table

Now. let’s create the table to hold the data we want to query.

  1. Click the arrow icon on the right side of the name of the dataset just created.
  2. In the popup menu, click Create new table.  A wizard dialog window is displayed.
  3. In the next dialog window, in the Table ID box enter the name of the table. In the picture shown next, the name is Names_1930. Notice that the Dataset ID is PopularNames.
    BQ_Table_Create
  4. Click Next.
  5. In the next dialog window, keep the source format CSV selected.
  6. In the Load data from, click the Choose file button.BQ_Table_Load_Data
  7. From your local drive select the file yob1930.txt.
  8. Click Next.
  9. In the next dialog window, define the Edit Schema. Click the link Edit as text and enter the following schema: name:string,gender:string,count:integer.
  10. Click Next.
  11. In the next dialog window, assure that the field delimiter selection is Comma.
  12. Click Submit.  Wait for the data to load.  If the loading is successful you should see the table name Names_1930 displayed under the PopularNames dataset, in the left panel.

Query the Data

Finally, here the rubber hits the road. Let’s query the table we have just populated. .

    1. In the left panel, click the name of the table just created. You will see the schema information displayed in the main panel as shown in the next picture.BQ_Table_Details
    2. With the Names_1930 table selected, click the Query Table button in the upper right.
    3. In the New Query box enter the following query:
      SELECT name,count FROM PopularNames.Names_1930
      WHERE gender = ‘F’ ORDER BY count DESC LIMIT 5;
    4. Click the Run Query button. You should get the following results:BQ_Table_Query_Results
      … So Mary was the most popular name for baby girls in the US in the year 1930.