Utility program: add all images in your directory to make a video


If you want to add all the images in your directory sequentially to make a video for a presentation then you can use this utility program.

In the first step, we will make a series of “.png” images by plotting the bar diagram of the GDP of the top 10 countries with years (1960-2017). We then add all these images sequentially to make a video. Here, we will add the images using the cv2 package of Python. Please keep in mind that the video can also be created using the matplotlib package of python. This is an alternative to perform the same task. In some cases, when we have a collection of images and we want to add them, then also this can be very handy.

To install the cv2 package using Anaconda, type:

conda install -c menpo opencv

Now, we can obtain the data for GDP of all countries from 1960-2017 from the World Bank website. We then use “pandas” package of Python to read the csv data file and “matplotlib” package to plot the bar diagram.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

# function for setting the format of values in billions of dollars
def trillions(x,pos):
    'The two args are the value and tick position'
    return '${:.1f}T'.format(x * 1e-12)
formatter = FuncFormatter(trillions)

df = df.set_index("Country Name")
with open('neglect_regions.txt',mode='r') as file:
    alist = [line.rstrip() for line in file]
# print(alist)

for year in years:
    top_10=df.nlargest(num, str(year))
    # print(top_10[str(year)])
    fig, ax = plt.subplots(figsize=(10,6))
    plt.xticks(np.arange(num), top_10.index.values,fontsize=6)
    plt.close(fig) #to clear the memory




The series of bar diagram for each year from 1960-2017 is saved in the “figures” directory. Then we can run the command:

python conv_image2video.py -loc "figures" -o "world_countries_gdp.mp4"

to add all the “.png” images in the figures directory to make the video named “world_countries_gdp.mp4”.

Usage of the program “conv_image2video.py”

  1. To convert the images in the current directory, with the extension “.png”, output file name “output.mp4” and frame rate 5 fps, simply type:

python conv_image2video.py

  1. If you want to change some options, add the following flags:
-loc "path_to_images": This will look for images in the directory "path_to_images"; default="."
-o "custom_name.mp4": This will output the video with the name "custom_name.mp4"; default: "output.mp4"
-ext "jpg": This will work on the jpeg images; default: "png"
-fr "1": This will change the speed/frame rate of the video; default: "5"


To download this program, click here.

If you want to download all the files (program +examples), click here.

The “ABC” of writing or editing scientific manuscript

If you’re doing science, you need to, inevitably, write a manuscript. I have read and collected, from various sources, some of the important points about writing in science. The most important and recommendable source is the “Writing in Science” course by Stanford University! I would like to share my collection with my readers. Don’t worry, I will keep it short and concise so that you don’t need to read it like a novel but you can always refer to it while writing.

The aim of the writer should be to make the manuscript clear, elegant, and stylish. It should have streamlined flow of ideas for which the writer must follow some set of rules. After completing the sentence, the writers must ask themselves whether it is: readable, understandable, and enjoyable to read.

Editing the manuscript:

Sentence Level Editing:

Writing a manuscript always starts with editing at the lowest level, which is a sentence. For editing the manuscript at the sentence level, the writer should keep in mind a few things:

  1. Use active voice (Subject + Verb + Object). It is livelier and easy to read.
  2. Write with verbs (instead of nouns).
    • Use strong verbs
    • Avoid turning verbs into nouns
    • Don’t bury the main verb
    • Pick right verb
    • Use “to be” verbs purposefully and sparingly.
    • Don’t turn spunky verbs into clunky nouns.
    • “Compared to” to point out similarities between different things. “Compared with” to point out differences between similar things (often used in science).
    • “That” defining. “Which” non-defining
  3. Avoid using “his/her”. Instead, use “their”.
  4. Cut unnecessary words and phrases ruthlessly. Get rid of
    • Dead-weight words and phrases. E.g., as it is well known, as it has been shown.
    • Empty words and phrases. E.g., basic tenets of, methodologic, important.
    • Long words or phrases.
    • Unnecessary jargons and acronyms
    • Repetitive words or phrases
    • Adverbs. E.g., very, really, quickly, basically, generally etc
  5. Eliminate negatives, and, superfluous uses of “there are/ there is”.
  6. Omit needless prepositions. Change, “They agreed that it was true” to “they agreed it was true”.
  7. Experiment with punctuation (comma, colon, dash, parentheses, semicolon, period). Use them to vary sentence structure.
    • Semicolon connects two independent clauses.
    • Colon to separate items in a list, quote, explanation, conclusion or amplification.
    • Parentheses to insert an afterthought/explanation.
    • Dash to add emphasis, or to insert an abrupt definition or description, join or condense. Don’t overuse it, or it loses its impact.
  8. Pairs of ideas joined by “and”, “or”, or “but” should be written in parallel form. E.g., The velocity decreased by 50% but the pressure decreased by only 10%.

Paragraph level Editing:

  1. 1 paragraph = 1 idea
  2. Give away the punch line early.
  3. The logical flow of ideas. General -\> specific-\> take home message. Logical arguments: if a then b; a therefore b.
  4. Parallel sentence structure.
  5. If necessary then transition words.
  6. The emphasis at the end.
  7. Variable sentence length. Long, short, long
  8. Follow: Arguments, counter-arguments, rebuttals

Writing Process:

Many writers are not sure how to start and how to organize their work. Here are some tips.

  1. Prewriting: give 70% time
    • Get Organized first
      • Arrange key facts and citations from literature into a crude road map- think in paragraphs and sections.
      • Like ideas should be grouped; like paragraphs should be grouped.
  2. Writing the first draft: give 10% time
    • Don’t be a perfectionist: get the ideas down in complete order.et
    • Focus on logical organization
    • Write it quickly and efficiently
  3. Revision: give 20% time
    • Read out your work loud: Brain processes the spoken word differently
    • Do a verb check: Underline the main verb in each sentence (lackluster verbs, passive verbs, buried verbs).
    • Cut clutter: Watch out for: dead weight words, empty words, long words and phrases, Unnecessary jargons and acronyms, repetitive words or phrases, adverbs.
    • Do an organizational review: tag each paragraph with a phrase or sentence that sums up the main point.
    • Get feedback from others: ask someone outside your department to read your manuscript. They should be able to grasp: the main findings, take-home messages, and significance of your work. Ask them to point out particularly hard-to-read sentences and paragraphs.
    • Get editing help: find a good editor to edit your work.


Checklist for Final Draft

  1. Check for consistency: the values of any variable (such as the mean temperature of your data) used in different sentences, paragraphs or sections should be the same.
  2. Check for numerical consistency:
    • Numbers in your abstract should match the numbers in your tables/figures/text,
    • Numbers in the text should match those in tables/figures.
  3. Check for references:
    • Does that information really exist in that paper?
    • Always cite/go back to primary source.

The Original Manuscript:

Recommended order of writing:

  1. Tables and Figures: Very important
    • Should stand-alone and tell a complete story. The reader should not need to refer back to the text.
    • Use fewest figures/tables
    • Do not present the same data in both table and figure.
    • Figures: Visual impact, show trends and patterns, tell a quick story, tell a whole story, highlight particular result
      • Keep it simple (If it’s too complex then maybe it belongs to the table)
      • Make easy to distinguish the group.
    • Tables: give precise values.
      • Use superscript symbols to identify footnotes and give footnotes to explain experimental details.
      • Use three horizontal lines for table format.
      • Make sure everything lines up and looks professional
      • Use a reasonable number of significant figures
      • Give units
      • Omit unnecessary columns
  2.  Results:
    • Summarize what the data show
    • Point out simple relationships
    • Describe big picture trends
    • Cite figures or tables that present supporting data.
    • Avoid repeating the numbers that already available in tables or figures.
    • Break into subsections with headings, if necessary.
    • Complement the information that is already in tables and figures
    • Give precise values that are not available in the figure
    • Report the percent change or percent difference if the absolute values are given in tables.
    • Don’t forget to talk about negative results.
    • Reserve information about what you did for the methods section
    • Reserve comments on the meaning of your results for the discussion section.
    • Use past tense for completed actions:
      • We found that…
      • Women were more likely to…
      • Men smoked more cigarettes than…
    • Use the present tense for assertions that continue to be true, such as what the tables show, what you believe, and what the data suggest:
      • Figure 1 shows…
      • The findings confirm….
      • The data suggest…
      • We believe that this shows…
      • Use the active voice
  3. Methods:
    • Give a clear overview of what was done.
    • Give enough information to replicate the study.
    • Be complete but make life easy for your reader:
      • Break into smaller subsections with subheadings
      • Cite a reference for commonly used methods
      • Display a flow diagram or table where possible.
      • May use jargon and the passive voice more liberally
    • Use past tense to report methods (“we measured”)
    • Use present tense to describe how data are presented in the paper (“data are summarized as means +- SD”)
  4.  Introduction:
    • Typically 3 paragraphs long (recommended range 2-5)
    • Should focus on the specific hypothesis/aim of your study
    • Information comes in Cone format:
      • What’s known: Paragraph 1
      • What’s unknown: limitations and gaps in previous studies: paragraph 2
      • Your burning question/hypothesis/aim: paragraph 3
      • Experimental approach: paragraph 3
      • Why your experimental approach is new, different and important: paragraph 3
    • Keep paragraphs short
    • Write for the general audience (clear, concise, non-technical)
    • Do not answer the research questions.
    • Summarize at a high level
  5. Discussion:
    • Information comes in inverted cone format
    • Answer the questions asked
    • Support your conclusion
    • Defend your conclusion
    • Give a big-picture take-home message: what do my results mean and why should anyone care. Make sure your take-home message is clear and consistent.
    • Use active voice
    • Tell it like a story
    • Don’t travel too far from your data
    • Focus on limitations that matter
    • Verb Tense:
      • Past when referring to study details, results, analyses, and background research. E.g., we found that…, Subjects may have experienced…, Miller et al. found…
      • Present when talking about what data suggest. E.g, the greater weight loss suggests…, the explanation for this difference is not clear, potential explanation includes…
  6. Abstract:
    • Overview of the main story
    • Gives highlights from each section of the paper
    • Limited length (100-300 words)
    • Stands on its own
      • Background
      • Question/aim/hypothesis

      • Experiment
      • Results
      • Conclusion
      • Implication, speculation or recommendation


In the end, it is profoundly important to stay away from plagiarism. Do not pass off other people’s writing (or tables and figures) as your own.

What is plagiarism:

  1. Cutting or pasting sentences or even phrases
  2. Slightly rewriting or re-arranging others’ words. It is unlikely that 2 people will come up with exact 7-8 strings in a sentence independently.

-Utpal Kumar

Institute of Earth Sciences, Academia Sinica

How to make the rotating globe


Many people have asked me how to make the rotating globe we have on our home page. This post will take you step by step how to make the rotating globe gif using “gnuplot”.

Installing GNUPLOT

To install gnuplot, you can follow this page.

In Mac, you can install gnuplot using the homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

This will install homebrew in your Mac in case you don’t have.
Then use the command:

brew install gnuplot --with-x11

This should do the job and then you should be able to run the command

Screen Shot 2018-11-03 at 3.55.53 PM.png
If this throws error then you can overwrite the symbolic links by
brew link --overwrite gnuplot

Making the rotating globe gif

Now, you can download the codes for making the gif from the GitHub by clicking here (Data Source: gnuplotting). Run this bash script in your computer, it may take a few seconds and you should be able to get the “globe.gif” file.

Screen Shot 2018-11-03 at 4.05.23 PM.png


Working with MATLAB & Python simultaneously and smoothly​

MATLAB and Python are both well known for its usefulness in the scientific computing world. There are several advantages of using one over another. Python is preferable for most of the programmers because it’s free, beautiful, and powerful. MATLAB, on the other hand, has advantages like there is the availability of a solid amount of functions, and Simulink. Both the languages have a large scientific community and are easier to learn for the beginners. Though MATLAB, because it includes all the packages we need is easier to start with than the Python where we need to install extra packages and also require an IDE. The proprietary nature of the algorithms in MATLAB i.e., we cannot see the codes of most of the algorithms and we have to trust them to implement for our usage makes us sometimes hard to prefer MATLAB over the other options available like Python. Besides, these proprietary algorithms of MATLAB come at a cost!

In short, in order to excel in all of our scientific tasks, we need to learn to utilize both of them interchangeably. This is where the Python’s module SciPy comes in handy. MATLAB reads its proprietary “mat” data format quite efficiently and fastly. The SciPy’s module, “loadmat” and “savemat” can easily read and write the data stored in the Python variable into the “mat” file, respectively. Here, we show an example to use the data from MATLAB in the “mat” format to plot in Python on a geographical map, which Python can execute much efficiently than MATLAB.

Saving the MATLAB variables into a “mat” file

In this example MATLAB script, we show how can we read the “mat” file in the MATLAB script and save it.

clear; close all; clc 

%% Load the two matfiles with some data into the workspace
load python_export_CME_ATML_orig_vars;
load station_info;

%% Remove the unrequired variables from the MATLAB's memory 
clearvars slat slon;

%% Saving the data from the "station_info.mat" file as a cell data type
stns={slons' slats' stn_name'}; 

%% Conduct some operations with the data and save it in "admF" variable
admF=[]; %intializing the matrix
for i=1:length(slons)
    ccU=corrcoef(dU(:,i),slU); %Making use of the available MATLAB functions
    admF=[admF ccU(1,2)*(std_dU/std_slU)];

%% Saving the output "admF" matrix into the "MATLAB_export_admittance_output.mat" file at a desired location.

Using the data from the “mat” file in Python to plot on a geographical map

import scipy.io as sio #importing the scipy io module for reading the mat file
import numpy as np #importing numpy module for efficiently executing numerical operations
import matplotlib.pyplot as plt #importing the pyplot from the matplotlib library
from mpl_toolkits.basemap import Basemap #importing the basemap to plot the data onto geographical map
from matplotlib import rcParams
rcParams['figure.figsize'] = (10.0, 6.0) #predefine the size of the figure window
rcParams.update({'font.size': 14}) # setting the default fontsize for the figure
from matplotlib import style
style.use('ggplot') # I use the 'ggplot' style for plotting. This is optional and can be used only if desired.

# Change the color of the axis ticks
def setcolor(x, color):
     for m in x:
         for t in x[m][1]:

# Read the two mat files and saving the MATLAB variables as the Python variables
ADF = sio.loadmat('MATLAB_export_admittance_output.mat')

STN = sio.loadmat('station_info.mat')

## Converting MATLAB cell type to numpy array data type
for ss in stnname:

## Plotting the admittance values
m = Basemap(llcrnrlon=min(slon)-offset,llcrnrlat=min(slat)-offset,urcrnrlon=max(slon)+offset,urcrnrlat=max(slat)+offset,
        resolution ='h',area_thresh=1000.)
xw,yw=m(slon,slat) #projecting the latitude and longitude data on the map projection

m.drawmapboundary(fill_color='#99ffff',zorder=0) #plot the map boundary
m.fillcontinents(color='w',zorder=1) #fill the continents region
m.drawcoastlines(linewidth=1,zorder=2) #draw the coastlines
# draw parallels
par=m.drawparallels(np.arange(21,26,1),labels=[1,0,0,0], linewidth=0.0)
setcolor(par,'k') #The color of the latitude tick marks has been set to black (default) but can be changed to any desired color
# draw meridians
m.drawmeridians(np.arange(120,123,1),labels=[0,0,0,1], linewidth=0.0)

cax=m.scatter(xw,yw,c=admF,zorder=3,s=300*admF,alpha=0.75,cmap='viridis') #plotting the data as a scatter points on the map
cbar = m.colorbar(cax) #plotting the colorbar
cbar.set_label(label='Estimated Admittance Factor',weight='bold',fontsize=16) #customizing the colorbar
plt.savefig('all_stations_admittance.png',dpi=200,bbox_inches='tight') #saving the best cropped output figure as a png file with resolution of 200 dpi.

Output figure from Python


Plotting the geospatial data clipped by coastlines in Python

In geosciences, we most frequently have to make geospatial plots, but the available data is unevenly distributed and irregular (Figure 1). We like to show the data, in general, for the whole region and one way of performing, so it to do the geospatial interpolation of the data. Geospatial interpolation means merely that we obtain the interpolated values of the data at regular grid points, both longitudinally and latitudinally. After obtaining these values, if we plot the data then the grid points is most likely to extend out of the coastline constrain of our study. We wish to plot the data inside the coastline borders of the area, which is our area of study. We can do that by just removing all the grid points outside the perimeter. One way to clip the data outside the coastline path is to manually remove the grid points outside the region, but this method is quite tedious. We, programmers, love being lazy and that helps us to seek better ways.

Figure 1: Scatter plot of the data. The size and color of the circles represent the data values.

In this post, we aim to do (1) the interpolation of these data values using the ordinary kriging method and (2) plot the output within the coastline border of Taiwan.

For implementing the ordinary kriging interpolation, we will use the “pykrige” kriging toolkit available for Python. The package can be easily installed using the “pip” or “conda” package manager for Python.

We start by importing all the necessary modules.

import numpy as np
import pandas as pd
import glob
from pykrige.ok import OrdinaryKriging
from pykrige.kriging_tools import write_asc_grid
import pykrige.kriging_tools as kt
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
from matplotlib.colors import LinearSegmentedColormap
from matplotlib.patches import Path, PathPatch
Screen Shot 2018-08-18 at 7.25.32 PM.png
Figure 2: Data in the tabular format

The first step for interpolation is to read the available data. Our data is of the format shown in Figure 2. Let’s say we want to interpolate for the “R_FACTOR”. We first read this data file. We can easily do that using the “pandas” in Python.


Now, we have read the whole tabular data, but we need only the “R_FACTOR”, “St.Lat”, and “St.Lon” columns.


Now, we have our required data available in the three variables. We can, now, define the grid points where we seek the interpolated values.

grid_lon = np.arange(lons.min()-0.05, lons.max()+0.1, grid_space) 
grid_lat = np.arange(lats.min()-0.05, lats.max()+0.1, grid_space)

The minimum and maximum of the longitude and latitude are chosen based on the data.

We use the “ordinary kriging” function of “pykrige” package to interpolate our data at the defined grid points. For more details, the user can refer to the manual of the “pykrige” package.

OK = OrdinaryKriging(lons, lats, data, variogram_model='gaussian', verbose=True, enable_plotting=False,nlags=20)
z1, ss1 = OK.execute('grid', grid_lon, grid_lat)

“z1” is the interpolated values of “R_FACTOR” at the grid_lon and grid_lat values.

Now, we wish to plot the interpolated values. We will use the “basemap” module to plot the geographic data.

xintrp, yintrp = np.meshgrid(grid_lon, grid_lat)
fig, ax = plt.subplots(figsize=(10,10))
m = Basemap(llcrnrlon=lons.min()-0.1,llcrnrlat=lats.min()-0.1,urcrnrlon=lons.max()+0.1,urcrnrlat=lats.max()+0.1, projection='merc', resolution='h',area_thresh=1000.,ax=ax)

We, first, made the 2D meshgrid using the grid points and then call the basemap object “m” with the Mercator projection. The constraints of the basemap object can be manually defined instead of the minimum and maximum of the latitude and longitude values as used.

m.drawcoastlines() #draw coastlines on the map
x,y=m(xintrp, yintrp) # convert the coordinates into the map scales
cs=ax.contourf(x, y, z1, np.linspace(0, 4500, ncols),extend='both',cmap='jet') #plot the data on the map.
cbar=m.colorbar(cs,location='right',pad="7%") #plot the colorbar on the map
# draw parallels.
parallels = np.arange(21.5,26.0,0.5)
m.drawparallels(parallels,labels=[1,0,0,0],fontsize=14, linewidth=0.0) #Draw the latitude labels on the map

# draw meridians
meridians = np.arange(119.5,122.5,0.5)
m.drawmeridians(meridians,labels=[0,0,0,1],fontsize=14, linewidth=0.0)
Figure 3: Interpolated values without masking the outer region.

This will give us the plot of the interpolated values (Figure 3). Here, we do not seek the plot outside the coastline boundary of Taiwan. We wish to mask the data outside the boundary.

##getting the limits of the map:
x0,x1 = ax.get_xlim()
y0,y1 = ax.get_ylim()
map_edges = np.array([[x0,y0],[x1,y0],[x1,y1],[x0,y1]])
##getting all polygons used to draw the coastlines of the map
polys = [p.boundary for p in m.landpolygons]

##combining with map edges
polys = [map_edges]+polys[:]
##creating a PathPatch
codes = [
[Path.MOVETO]+[Path.LINETOfor p in p[1:]]
for p in polys

polys_lin = [v for p in polys for v in p]
codes_lin = [c for cs in codes for c in cs]

path = Path(polys_lin, codes_lin)
patch = PathPatch(path,facecolor='white', lw=0)

Here, the ‘facecolor’ of the ‘pathpatch’ defines the color of the masking. We kept is ‘white,’ but the user can define any color they like.

##masking the data outside the inland of taiwan
Figure 4: Interpolated values with the masking of the outer region.


Time-series Analysis using Python I


Time-series analysis is essential in most fields of science including geophysics, economics, etc. Most of the geophysical data comes in a time-series format including the seismic recordings. In this part of the series of tutorial, we will see how we can quickly load the data, and visualize it.


This tutorial does not require the reader to have any basic understanding of Python or any programming language. But we expect the reader to have installed the jupyter notebook on their system. If the reader has not installed it yet, then they can follow the previous post where we went through the steps involved in getting started with Python.

What is Time-series?

Time-series is a collection of data at fixed time intervals. This can be analyzed to obtain long-term trends, statistics, and many other sorts of inferences depending on the subject.


We also need some data to undergo the analysis. We demonstrate the analysis using our GPS data. It can be downloaded from here.

Let’s get started

The first step is always to start the Python interpreter. In our case, we will use the jupyter notebook.

Jupyter notebook can be started using the terminal. Firstly, navigate to your directory containing the data and the type “jupyter notebook” on your terminal.

jupyter notebook

Next, we create a new Python 3 notebook, rename it as pythontut1. Then, we need to import some of the libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.pyplot import rcParams
rcParams['figure.figsize'] = 15, 6

Loading the data

Now, we load the data using the pandas library functions. Here, we use the function read_csv. But, before that let’s observe the format of the data:

!head 0498.COR

The prefix “!” can be used to execute any Linux command in the notebook.

We can see that the data has no header information, and 8 columns. The columns are namely, “year”, “latitude”, “longitude”, “Height”, “dN”, “dE”, “dU”.

So, now we read the data and set the above names to the different columns.

df=pd.read_csv("0498.COR", header=None, delimiter='\s+', names=['Year',"Lat","Long","Hgt","dN","dE","dU","nav"])

It is essential to understand the above command. We gave the argument of the filename, header (default is the first line), delimiter (default is a comma) and the names of each column, respectively. Then we output the first 5 lines of the data using the df.head() command.

Our data is now loaded, and if we want to extract any section of the data, we can easily do that.


Here, we have used the column names to extract the two columns only. We can also use the index values.


When we use “.loc” method to extract the section of the data, then we need to use the column name whereas when we use the “.iloc” method then we use the index values. Here, df.iloc[:,3] extracts all the rows of the 3rd column (“Hgt”).


Now, we have the data loaded. Let’s plot the “dN”, “dE”, and “dU” values versus the year. Before doing that, let’s set the “Year” column as the index column.

df.set_index("Year", inplace=True)

We can see the “Year” column as the index of the data frame now. Plotting using Pandas is extremely easy.


We can also customize the plot easily.

plt.suptitle("GPS Data Visualization")

If we want to save the figure, then we can use the command:


This saves the figure as the pdf file named “0498Data.pdf”. The format can be set to any type “.png”, “.jpg”, ‘.eps”, etc. We set the resolution to be 300 dpi. This can be varied depending on our need. Lastly, “bbox_inches =‘tight’” crops our figure to remove all the unnecessary space.

Next Tutorial

We have loaded the data and visualized it. But we can see that our data has some trend and seasonality. In the future tutorial, we will learn how to remove that.

Python: a language for science

Python has become the most popular language for programming and its community is huge, active and ever-increasing. Before Python, MATLAB was the preferred language for the scientific community but now most of the researchers are migrating to Python. Like MATLAB, Python is quite easy to use but over that Python can be used for the system control, software developing, easy integration with other languages like FORTRAN, C , etc. This makes it extremely popular among all sorts of communities. Some P.Is want their students to work in Python because according to them, even when the students want to switch the field or simply leave the field, it could still be useful to them.

If you are familiar with other programming languages including MATLAB, then its very easy to switch to Python. Though, there are some issues the beginners and the MATLAB users face while making a move to Python. Here, let us deal with that issue and get started with Python.

Getting started with Python

Currently, there are two versions of Python – 2.7, and 3.6. But we will stick with Python 3.6. The steps for the version 2.7 is very similar.

Python can be downloaded directly from its official website https://www.python.org/downloads/.

But, here, we will download Python from the Anaconda website. Steps are listed below:

  1. Download the installer for the Windows, Mac, or the Linux system. For Windows, you will get a “.exe” extension file, double-click the file and follow the installation procedure.For Mac and Linux system, the file will be with the extension “.sh” (a shell- script). You can open your favorite terminal, navigate to the directory containing the downloaded Miniconda installation file. Then, you can simply type “sh Miniconda*.sh” and follow the steps prompted. You will be asked to accept the license and enter your installation location. If you don’t have any preferred location then the installer will install Miniconda in your home directory.

  2. Now, Miniconda, as well as Python 3.6, has been installed on your machine. We need some more Python libraries to get going. We can easily install these now using Miniconda. Open your terminal and type “conda install jupyter notebook pandas numpy matplotlib”.

We now have all the necessary libraries to get started with Python. Let us open a terminal again and simply type “jupyter notebook” to launch jupyter notebook. Jupyter notebook is a very popular interface which makes using Python a fun experience. It has all the features to make things simpler and useful. You can even run the shell command in the notebook. Over that, you can write your notes using the Markdown language, which is a very straight-forward language for writing. I strongly recommend you to learn Markdown. It won’t take more than 10 minutes to learn and will be useful in so many other places.

Now, that we have all the necessary tools to use Python then let’s use python to make a very simple plot.

  1. First, start jupyter notebook by typing “jupyter notebook” on your terminal.It will open “Jupyter” in your favorite browser.
  2. Now, go to the “New” tab at the top right corner and then select “Python 3”. It will open a new page for you.
  3. Let’s import some libraries which we have installed using the conda by typing the command as directed in the figure below. To execute the command, use the “Command/Ctrl + Enter” key.
  4. We also need to execute the command “%matplotlib inline” to keep our plots inline in the notebook. This is one of the many features of jupyter notebook.
  5. Let’s define our x and y values using the numpy library which we have imported as “np” for short-form.
  6. Let’s plot our data.

Writing NetCDF4 Data using Python

For how to read a netCDF data, please refer to the previous post. Also, check the package and tools required for writing the netCDF data, check the page for reading the netCDF data.

Importing relevant libraries

import netCDF4 
import numpy as np

Screen Shot 2017-10-03 at 2.20.50 PM.png

Let us create a new empty netCDF file named “new.nc” in the “../../data” directory and open it for writing.

ncfile = netCDF4.Dataset('../../data/new.nc',mode='w',format='NETCDF4_CLASSIC') 

Screen Shot 2017-10-03 at 2.30.59 PM.png

Notice here that we have set the mode to be “w”, which means write mode. We can also open the data in append mode (“a”). It is safe to check whether the netCDF file has closed, using the try and except statement.

Creating Dimensions

We can now fill the netCDF files opened with the dimensions, variables, and attributes. First of all, let’s create the dimension.

lat_dim = ncfile.createDimension('lat', 73) # latitude axis
lon_dim = ncfile.createDimension('lon', 144) # longitude axis
time_dim = ncfile.createDimension('time', None) # unlimited axis (can be appended to).
for dim in ncfile.dimensions.items():

Screen Shot 2017-10-03 at 2.35.59 PM.png

Every dimension has a name and length. If we set the dimension length to be 0 or None, then it takes it as of unlimited size and can grow. Since we are following the netCDF classic format, only one dimension can be unlimited. To make more than one dimension to be unlimited follow the other format. Here, we will constrain to the classic format only as it is the simplest one.

Creating attributes

One of the nice features of netCDF data format is that we can also store the meta-data information along with the data. This information can be stored as attributes.

ncfile.title='My model data'

Screen Shot 2017-10-03 at 2.43.38 PM.png

ncfile.subtitle="My model data subtitle"
ncfile.anything="write anything"

Screen Shot 2017-10-03 at 2.45.55 PM.png

We can add as many attributes as we like.

Creating Variables

Now, let us add some variables to store some data in them. A variable has a name, a type, a shape and some data values. The shape of the variable can be stated using the tuple of the dimension names. The variable should also contain some attributes such as units to describe the data.

lat = ncfile.createVariable('lat', np.float32, ('lat',))
lat.units = 'degrees_north'
lat.long_name = 'latitude'
lon = ncfile.createVariable('lon', np.float32, ('lon',))
lon.units = 'degrees_east'
lon.long_name = 'longitude'
time = ncfile.createVariable('time', np.float64, ('time',))
time.units = 'hours since 1800-01-01'
time.long_name = 'time'
temp = ncfile.createVariable('temp',np.float64,('time','lat','lon')) # note: unlimited dimension is leftmost
temp.units = 'K' # degrees Kelvin
temp.standard_name = 'air_temperature' # this is a CF standard name

Screen Shot 2017-10-03 at 2.51.29 PM.png

Here, we create the variable using the createVariable method. This method takes 3 arguments: a variable name (string type), data types, a tuple containing the dimension. We have also added some attributes such as for the variable lat, we added the attribute of units and long_name. Also, notice the units of the time variable.

We also have defined the 3-dimensional variable “temp” which is dependent on the other variables time, lat and lon.

In addition to the custom attributes, the netCDF provides some pre-defined attributes as well.

print("-- Some pre-defined attributes for variable temp:")
print("temp.dimensions:", temp.dimensions)
print("temp.shape:", temp.shape)
print("temp.dtype:", temp.dtype)
print("temp.ndim:", temp.ndim) 

Screen Shot 2017-10-03 at 2.57.36 PM

Since no data has been added, the length of the time dimension is 0.

Writing Data

nlats = len(lat_dim); nlons = len(lon_dim); ntimes = 3
lat[:] = -90. + (180./nlats)*np.arange(nlats) # south pole to north pole
lon[:] = (180./nlats)*np.arange(nlons) # Greenwich meridian eastward
data_arr = np.random.uniform(low=280,high=330,size=(ntimes,nlats,nlons))
temp[:,:,:] = data_arr # Appends data along unlimited dimension
print("-- Wrote data, temp.shape is now ", temp.shape)
print("-- Min/Max values:", temp[:,:,:].min(), temp[:,:,:].max())

Screen Shot 2017-10-03 at 3.02.52 PM.png

The length of the lat and lon variable will be equal to its dimension. Since the length of the time variable is unlimited and is subject to grow, we can give it any size. We can treat netCDF array as a numpy array and add data to it. The above statement writes all the data at once, but we can do it iteratively as well.

Now, let’s add another time slice.

data_slice = np.random.uniform(low=280,high=330,size=(nlats,nlons))
temp[3,:,:] = data_slice 
print("-- Wrote more data, temp.shape is now ", temp.shape) 

Screen Shot 2017-10-03 at 3.10.20 PM.png

Note, that we haven’t added any data to the time variable yet.

times_arr = time[:]

Screen Shot 2017-10-03 at 3.12.50 PM.png

The dashes indicate that there is no data available. Also, notice the 4 dashes corresponding to the four levels in of the time stacks.

Now, let us write some data to the time variable using the datetime module of Python and the date2num function of netCDF4.

import datetime as dt
from netCDF4 import date2num,num2date
dates = [dt.datetime(2014,10,1,0),dt.datetime(2014,10,2,0),dt.datetime(2014,10,3,0),dt.datetime(2014,10,4,0)]

Screen Shot 2017-10-03 at 3.17.16 PM.png

times = date2num(dates, time.units)
print(times, time.units) # numeric values

Screen Shot 2017-10-03 at 3.18.53 PM.png

Now, it’s important to close the netCDF file which has been opened previously. This flushes buffers to make sure all the data gets written. It also releases the memory resources used by the netCDF file.

# first print the Dataset object to see what we've got
# close the Dataset.
ncfile.close(); print('Dataset is closed!')

Screen Shot 2017-10-03 at 3.23.38 PM.png


Reading NetCDF4 Data in Python

In Earth Sciences, we often deal with multidimensional data structures such as climate data, GPS data. It ‘s hard to save such data in text files as it would take a lot of memory as well as it is not fast to read, write and process it. One of the best tools to deal with such data is netCDF4. It stores the data in the HDF5 format (Hierarchical Data Format). The HDF5 is designed to store a large amount of data. NetCDF is the project hosted by Unidata Program at the University Corporation for Atmospheric Research (UCAR).

Here, we learn how to read and write netCDF4 data. We follow the workshop by Unidata. You can check out the website of Unidata.



You can install Python3 via the Anaconda platform. I would recommend Miniconda over Anaconda because it is more light and installs only fundamental requirements for Python.

NetCDF4 Package:

conda install -c conda-forge netcdf4

Reading NetCDF data:

Now, we are good to go. Let’s see how we can read a netCDF data. The netCDF data has the extension of “.nc”


Importing NetCDF and Numpy ( a Python library that supports large multi-dimensional arrays or matrices):

import netCDF4
import numpy as np

Now, let us create a NetCDF Dataset object:

f = netCDF4.Dataset('../../data/rtofs_glo_3dz_f006_6hrly_reg3.nc')

Screen Shot 2017-10-03 at 12.21.35 PM.png

Here, we have read a NetCDF file “rtofs_glo_3dz_f006_6hrly_reg3.nc”. When we print the object “f”, then we can notice that it has a file format of HDF5. It also has other information regarding the title, institution, etc for the data. These are known as metadata.

In the end of the object file print output, we see the dimensions and variable information of the data set. This dataset has 4 dimensions: MT (with size 1), Y (size: 850), X (size: 712), Depth (size: 10). Then we have the variables. The variables are based on the defined dimensions. The variables are outputted with their data type such as float64 MT (dimension: MT).

Some variables are based on only one dimension while others are based on more than one. For example, “temperature” variable relies on four dimensions – MT, Depth, Y, X in the same order.

We can access the information from this object, “f” just like we read a dictionary in Python.

print(f.variables.keys()) # get all variable names

Screen Shot 2017-10-03 at 12.35.04 PM.png

This outputs the names of all the variables in the read netCDF file referenced by “f” object.

We can also individually access each variable:

temp = f.variables['temperature'] # temperature variable

Screen Shot 2017-10-03 at 12.35.47 PM.png

The “temperature” variable is of the type float32 and has 4 dimensions – MT, Depth, Y, X. We can also get the other information (meta-data) like the coordinates, standard name, units of the variable. Coordinate variables are the 1D variables that have the same name as dimensions. It is helpful in locating the values in time and space. The unit of temperature variable data is “degC”. The current shape gives the information about the shape of this variable. Here, it has the shape of (1, 10, 850, 712) for each dimension.

We can also check the dimension size of this variable individually:

for d in f.dimensions.items():

Screen Shot 2017-10-03 at 12.44.11 PM.png

The first dimension “MT” has the size of 1, but it is of unlimited type. This means that the size of this dimension can be increased indefinitely. The size of the other dimensions is fixed.

For just finding the dimensions supporting the “temperature” variable:


Screen Shot 2017-10-03 at 12.51.38 PM.png


Screen Shot 2017-10-03 at 12.54.34 PM.png

Similarly, we can also inspect the variables associated with each dimension:

mt = f.variables['MT']
depth = f.variables['Depth']
x,y = f.variables['X'], f.variables['Y']

Screen Shot 2017-10-03 at 12.58.09 PM.png

Here, we obtain the information about each of the four dimensions. The “MT” dimension, which is also a variable has a long name of “time” and units of “days since 1900-12-31 00:00:00”.  The four dimensions denote the four axes, namely- MT: T, Depth: Z, X:X, Y: Y.

Now, how do we access the data from the NetCDF variable we have just read. The NetCDF variables behave similarly to NumPy arrays. NetCDF variables can also be sliced and masked.

Let us first read the data of the variable “MT”:

time = mt[:] 

Screen Shot 2017-10-03 at 1.07.22 PM.png

Similarly, for the depth array:

dpth = depth[:]

Screen Shot 2017-10-03 at 1.08.32 PM.png

We can also apply conditionals on the slicing of the netCDF variable:

xx,yy = x[:],y[:]
print('shape of temp variable: %s' % repr(temp.shape))
tempslice = temp[0, dpth > 400, yy > yy.max()/2, xx > xx.max()/2]
print('shape of temp slice: %s' % repr(tempslice.shape))

Screen Shot 2017-10-03 at 1.10.57 PM.png

Now, let us address one question based on the given dataset. “What is the sea surface temperature and salinity at 50N and 140W?

Our dataset has the variables temperature and salinity. The “temperature” variable represents the sea surface temperature (see the long name). Now, we have to access the sea-surface temperature and salinity at a given geographical coordinates. We have the variables latitude and longitude as well.

The X and Y variables do not give the geographical coordinates. But we have the variables latitude and longitude as well.

lat, lon = f.variables['Latitude'], f.variables['Longitude']

Screen Shot 2017-10-03 at 1.19.13 PM.png

Great! So we can access the latitude and longitude data. Now, we need to find the array index, say iy and ix such that Latitude[iy, ix] is close to 50 and Longitude[iy, ix] is close to -140. We can find out the index by defining a function:

# extract lat/lon values (in degrees) to numpy arrays
latvals = lat[:]; lonvals = lon[:] 

# a function to find the index of the point closest pt
# (in squared distance) to give lat/lon value.
def getclosest_ij(lats,lons,latpt,lonpt):
 # find squared distance of every point on grid
 dist_sq = (lats-latpt)**2 + (lons-lonpt)**2 
 # 1D index of minimum dist_sq element
 minindex_flattened = dist_sq.argmin()
 # Get 2D index for latvals and lonvals arrays from 1D index
 return np.unravel_index(minindex_flattened, lats.shape)

iy_min, ix_min = getclosest_ij(latvals, lonvals, 50., -140)

Screen Shot 2017-10-03 at 1.24.01 PM.png

So, now we have all the information required to answer the question.

sal = f.variables['salinity']
# Read values out of the netCDF file for temperature and salinity
print('%7.4f %s' % (temp[0,0,iy_min,ix_min], temp.units))
print('%7.4f %s' % (sal[0,0,iy_min,ix_min], sal.units))

Screen Shot 2017-10-03 at 1.27.04 PM.png

Accessing the Remote Data via openDAP:

We can access the remote data seamlessly using the netcdf4-python API. We can access via the DAP protocol and DAP servers, such as TDS.

For using this functionality, we require the additional package “siphon”:

conda install -c unidata siphon 

Now, let us access one catalog data:

from siphon.catalog import get_latest_access_url
URL = get_latest_access_url('http://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/Global_0p5deg/catalog.xml',
gfs = netCDF4.Dataset(URL)

Screen Shot 2017-10-03 at 1.36.59 PM.png

# Look at metadata for a specific variable
# gfs.variables.keys() #will show all available variables.
sfctmp = gfs.variables['Temperature_surface']
# get info about sfctmp

Screen Shot 2017-10-03 at 1.38.19 PM.png

# print coord vars associated with this variable
for dname in sfctmp.dimensions: 

Screen Shot 2017-10-03 at 1.39.42 PM.png

Dealing with the Missing Data

soilmvar = gfs.variables['Volumetric_Soil_Moisture_Content_depth_below_surface_layer']

Screen Shot 2017-10-03 at 1.42.51 PM.png

# flip the data in latitude so North Hemisphere is up on the plot
soilm = soilmvar[0,0,::-1,:] 
print('shape=%s, type=%s, missing_value=%s' % \
 (soilm.shape, type(soilm), soilmvar.missing_value))

Screen Shot 2017-10-03 at 1.44.02 PM.png

import matplotlib.pyplot as plt
%matplotlib inline
cs = plt.contourf(soilm)

Screen Shot 2017-10-03 at 1.45.33 PM.png

Here, the soil moisture has been illustrated on the land only. The white areas on the plot are the masked values.

Dealing with Dates and Times

The time variables are usually measured relative to a fixed date using a certain calendar. The specified units are like “hours since YY:MM:DD hh:mm:ss”.

from netCDF4 import num2date, date2num, date2index
timedim = sfctmp.dimensions[0] # time dim name
print('name of time dimension = %s' % timedim)

Screen Shot 2017-10-03 at 1.51.34 PM.png

Time is usually the first dimension.

times = gfs.variables[timedim] # time coord var
print('units = %s, values = %s' % (times.units, times[:]))

Screen Shot 2017-10-03 at 1.54.25 PM.png

dates = num2date(times[:], times.units)
print([date.strftime('%Y-%m-%d %H:%M:%S') for date in dates[:10]]) # print only first ten...

Screen Shot 2017-10-03 at 1.55.46 PM.png

We can also get the index associated with the specified date and forecast the data for that date.

import datetime as dt
date = dt.datetime.now() + dt.timedelta(days=3)
ntime = date2index(date,times,select='nearest')
print('index = %s, date = %s' % (ntime, dates[ntime]))

Screen Shot 2017-10-03 at 1.57.50 PM.png

This gives the time index for a time nearest to 3 days from today, current time.

Now, we can again make use of the previously defined “getcloset_ij” function to find the index of the latitude and longitude.

lats, lons = gfs.variables['lat'][:], gfs.variables['lon'][:]
# lats, lons are 1-d. Make them 2-d using numpy.meshgrid.
lons, lats = np.meshgrid(lons,lats)
j, i = getclosest_ij(lats,lons,40,-105)
fcst_temp = sfctmp[ntime,j,i]
print('Boulder forecast valid at %s UTC = %5.1f %s' % \

Screen Shot 2017-10-03 at 2.01.18 PM.png

So, we have the forecast for 2017-10-06 15 hrs. The surface temperature at boulder is 304.2 K.

Simple Multi-file Aggregation

If we have many similar data, then we can aggregate them as one. For example, if we have the many netCDF files representing data for different years, then we can aggregate them as one.

Screen Shot 2017-10-03 at 2.08.20 PM.png

Multi-File Dataset (MFDataset) uses file globbing to patch together all the files into one big Dataset.
Limitations:- It can only aggregate the data along the leftmost dimension of each variable.

  • It can only aggregate the data along the leftmost dimension of each variable.
  • only works with NETCDF3, or NETCDF4_CLASSIC formatted files.
  • kind of slow.
mf = netCDF4.MFDataset('../../data/prmsl*nc')
times = mf.variables['time']
dates = num2date(times[:],times.units)
print('starting date = %s' % dates[0])
print('ending date = %s'% dates[-1])
prmsl = mf.variables['prmsl']
print('times shape = %s' % times.shape)
print('prmsl dimensions = %s, prmsl shape = %s' %\
 (prmsl.dimensions, prmsl.shape))

Screen Shot 2017-10-03 at 2.10.53 PM.png

Finally, we need to close the opened netCDF dataset.


Screen Shot 2017-10-03 at 2.12.18 PM.png

To download the data, click here. Next, we will see how to write a netCDF data.