Reading NetCDF4 Data in Python

In Earth Sciences, we often deal with multidimensional data structures such as climate data, GPS data. It ‘s hard to save such data in text files as it would take a lot of memory as well as it is not fast to read, write and process it. One of the best tools to deal with such data is netCDF4. It stores the data in the HDF5 format (Hierarchical Data Format). The HDF5 is designed to store a large amount of data. NetCDF is the project hosted by Unidata Program at the University Corporation for Atmospheric Research (UCAR).

Here, we learn how to read and write netCDF4 data. We follow the workshop by Unidata. You can check out the website of Unidata.



You can install Python3 via the Anaconda platform. I would recommend Miniconda over Anaconda because it is more light and installs only fundamental requirements for Python.

NetCDF4 Package:

conda install -c conda-forge netcdf4

Reading NetCDF data:

Now, we are good to go. Let’s see how we can read a netCDF data. The netCDF data has the extension of “.nc”


Importing NetCDF and Numpy ( a Python library that supports large multi-dimensional arrays or matrices):

import netCDF4
import numpy as np

Now, let us open a NetCDF Dataset object:

f = netCDF4.Dataset('../../data/')

Screen Shot 2017-10-03 at 12.21.35 PM.png

Here, we have read a NetCDF file “”. When we print the object “f”, then we can notice that it has a file format of HDF5. It also has other information regarding the title, institution, etc for the data. These are known as metadata.

In the end of the object file print output, we see the dimensions and variable information of the data set. This dataset has 4 dimensions: MT (with size 1), Y (size: 850), X (size: 712), Depth (size: 10). Then we have the variables. The variables are based on the defined dimensions. The variables are outputted with their data type such as float64 MT (dimension: MT).

Some variables are based on only one dimension while others are based on more than one. For example, “temperature” variable relies on four dimensions – MT, Depth, Y, X in the same order.

We can access the information from this object, “f” just like we read a dictionary in Python.

print(f.variables.keys()) # get all variable names

Screen Shot 2017-10-03 at 12.35.04 PM.png

This outputs the names of all the variables in the read netCDF file referenced by “f” object.

We can also individually access each variable:

temp = f.variables['temperature'] # temperature variable

Screen Shot 2017-10-03 at 12.35.47 PM.png

The “temperature” variable is of the type float32 and has 4 dimensions – MT, Depth, Y, X. We can also get the other information (meta-data) like the coordinates, standard name, units of the variable. Coordinate variables are the 1D variables that have the same name as dimensions. It is helpful in locating the values in time and space. The unit of temperature variable data is “degC”. The current shape gives the information about the shape of this variable. Here, it has the shape of (1, 10, 850, 712) for each dimension.

We can also check the dimension size of this variable individually:

for d in f.dimensions.items():

Screen Shot 2017-10-03 at 12.44.11 PM.png

The first dimension “MT” has the size of 1, but it is of unlimited type. This means that the size of this dimension can be increased indefinitely. The size of the other dimensions is fixed.

For just finding the dimensions supporting the “temperature” variable:


Screen Shot 2017-10-03 at 12.51.38 PM.png


Screen Shot 2017-10-03 at 12.54.34 PM.png

Similarly, we can also inspect the variables associated with each dimension:

mt = f.variables['MT']
depth = f.variables['Depth']
x,y = f.variables['X'], f.variables['Y']

Screen Shot 2017-10-03 at 12.58.09 PM.png

Here, we obtain the information about each of the four dimensions. The “MT” dimension, which is also a variable has a long name of “time” and units of “days since 1900-12-31 00:00:00”.  The four dimensions denote the four axes, namely- MT: T, Depth: Z, X:X, Y: Y.

Now, how do we access the data from the NetCDF variable we have just read. The NetCDF variables behave similarly to NumPy arrays. NetCDF variables can also be sliced and masked.

Let us first read the data of the variable “MT”:

time = mt[:] 

Screen Shot 2017-10-03 at 1.07.22 PM.png

Similarly, for the depth array:

dpth = depth[:]

Screen Shot 2017-10-03 at 1.08.32 PM.png

We can also apply conditionals on the slicing of the netCDF variable:

xx,yy = x[:],y[:]
print('shape of temp variable: %s' % repr(temp.shape))
tempslice = temp[0, dpth > 400, yy > yy.max()/2, xx > xx.max()/2]
print('shape of temp slice: %s' % repr(tempslice.shape))

Screen Shot 2017-10-03 at 1.10.57 PM.png

Now, let us address one question based on the given dataset. “What is the sea surface temperature and salinity at 50N and 140W?

Our dataset has the variables temperature and salinity. The “temperature” variable represents the sea surface temperature (see the long name). Now, we have to access the sea-surface temperature and salinity at a given geographical coordinates. We have the variables latitude and longitude as well.

The X and Y variables do not give the geographical coordinates. But we have the variables latitude and longitude as well.

lat, lon = f.variables['Latitude'], f.variables['Longitude']

Screen Shot 2017-10-03 at 1.19.13 PM.png

Great! So we can access the latitude and longitude data. Now, we need to find the array index, say iy and ix such that Latitude[iy, ix] is close to 50 and Longitude[iy, ix] is close to -140. We can find out the index by defining a function:

# extract lat/lon values (in degrees) to numpy arrays
latvals = lat[:]; lonvals = lon[:] 

# a function to find the index of the point closest pt
# (in squared distance) to give lat/lon value.
def getclosest_ij(lats,lons,latpt,lonpt):
 # find squared distance of every point on grid
 dist_sq = (lats-latpt)**2 + (lons-lonpt)**2 
 # 1D index of minimum dist_sq element
 minindex_flattened = dist_sq.argmin()
 # Get 2D index for latvals and lonvals arrays from 1D index
 return np.unravel_index(minindex_flattened, lats.shape)

iy_min, ix_min = getclosest_ij(latvals, lonvals, 50., -140)

Screen Shot 2017-10-03 at 1.24.01 PM.png

So, now we have all the information required to answer the question.

sal = f.variables['salinity']
# Read values out of the netCDF file for temperature and salinity
print('%7.4f %s' % (temp[0,0,iy_min,ix_min], temp.units))
print('%7.4f %s' % (sal[0,0,iy_min,ix_min], sal.units))

Screen Shot 2017-10-03 at 1.27.04 PM.png

Accessing the Remote Data via openDAP:

We can access the remote data seamlessly using the netcdf4-python API. We can access via the DAP protocol and DAP servers, such as TDS.

For using this functionality, we require the additional package “siphon”:

conda install -c unidata siphon 

Now, let us access one catalog data:

from siphon.catalog import get_latest_access_url
URL = get_latest_access_url('',
gfs = netCDF4.Dataset(URL)

Screen Shot 2017-10-03 at 1.36.59 PM.png

# Look at metadata for a specific variable
# gfs.variables.keys() #will show all available variables.
sfctmp = gfs.variables['Temperature_surface']
# get info about sfctmp

Screen Shot 2017-10-03 at 1.38.19 PM.png

# print coord vars associated with this variable
for dname in sfctmp.dimensions: 

Screen Shot 2017-10-03 at 1.39.42 PM.png

Dealing with the Missing Data

soilmvar = gfs.variables['Volumetric_Soil_Moisture_Content_depth_below_surface_layer']

Screen Shot 2017-10-03 at 1.42.51 PM.png

# flip the data in latitude so North Hemisphere is up on the plot
soilm = soilmvar[0,0,::-1,:] 
print('shape=%s, type=%s, missing_value=%s' % \
 (soilm.shape, type(soilm), soilmvar.missing_value))

Screen Shot 2017-10-03 at 1.44.02 PM.png

import matplotlib.pyplot as plt
%matplotlib inline
cs = plt.contourf(soilm)

Screen Shot 2017-10-03 at 1.45.33 PM.png

Here, the soil moisture has been illustrated on the land only. The white areas on the plot are the masked values.

Dealing with Dates and Times

The time variables are usually measured relative to a fixed date using a certain calendar. The specified units are like “hours since YY:MM:DD hh:mm:ss”.

from netCDF4 import num2date, date2num, date2index
timedim = sfctmp.dimensions[0] # time dim name
print('name of time dimension = %s' % timedim)

Screen Shot 2017-10-03 at 1.51.34 PM.png

Time is usually the first dimension.

times = gfs.variables[timedim] # time coord var
print('units = %s, values = %s' % (times.units, times[:]))

Screen Shot 2017-10-03 at 1.54.25 PM.png

dates = num2date(times[:], times.units)
print([date.strftime('%Y-%m-%d %H:%M:%S') for date in dates[:10]]) # print only first ten...

Screen Shot 2017-10-03 at 1.55.46 PM.png

We can also get the index associated with the specified date and forecast the data for that date.

import datetime as dt
date = + dt.timedelta(days=3)
ntime = date2index(date,times,select='nearest')
print('index = %s, date = %s' % (ntime, dates[ntime]))

Screen Shot 2017-10-03 at 1.57.50 PM.png

This gives the time index for a time nearest to 3 days from today, current time.

Now, we can again make use of the previously defined “getcloset_ij” function to find the index of the latitude and longitude.

lats, lons = gfs.variables['lat'][:], gfs.variables['lon'][:]
# lats, lons are 1-d. Make them 2-d using numpy.meshgrid.
lons, lats = np.meshgrid(lons,lats)
j, i = getclosest_ij(lats,lons,40,-105)
fcst_temp = sfctmp[ntime,j,i]
print('Boulder forecast valid at %s UTC = %5.1f %s' % \

Screen Shot 2017-10-03 at 2.01.18 PM.png

So, we have the forecast for 2017-10-06 15 hrs. The surface temperature at boulder is 304.2 K.

Simple Multi-file Aggregation

If we have many similar data, then we can aggregate them as one. For example, if we have the many netCDF files representing data for different years, then we can aggregate them as one.

Screen Shot 2017-10-03 at 2.08.20 PM.png

Multi-File Dataset (MFDataset) uses file globbing to patch together all the files into one big Dataset.
Limitations:- It can only aggregate the data along the leftmost dimension of each variable.

  • It can only aggregate the data along the leftmost dimension of each variable.
  • only works with NETCDF3, or NETCDF4_CLASSIC formatted files.
  • kind of slow.
mf = netCDF4.MFDataset('../../data/prmsl*nc')
times = mf.variables['time']
dates = num2date(times[:],times.units)
print('starting date = %s' % dates[0])
print('ending date = %s'% dates[-1])
prmsl = mf.variables['prmsl']
print('times shape = %s' % times.shape)
print('prmsl dimensions = %s, prmsl shape = %s' %\
 (prmsl.dimensions, prmsl.shape))

Screen Shot 2017-10-03 at 2.10.53 PM.png

Finally, we need to close the opened netCDF dataset.


Screen Shot 2017-10-03 at 2.12.18 PM.png

To download the data, click here. Next, we will see how to write a netCDF data.

Generic Mapping Tool (GMT) for Beginners

To download all the scripts, figures and data, click here.

Basemap for Linear Plots:


Bash script to generate above plots is here.

Plotting Maps:

Bash script to generate above plots is here and here.

Writing Text on the Figure:


Bash script to generate above plots is here.

Contour Plots:

Bash script to generate above plots is here.

Colored Images and Color-bars:

Bash script to generate above plots is here.

3-D Plots:


Bash script to generate above plots is here.


Earthquake Finder

This program can be used to obtain the earthquake information from various sources. The user can search for any event for given time range, magnitude range, depth range, geographical area. It can also obtain the focal mechanism solutions for the given parameters. It gives the output file (catalog.txt by default) where all the information is stored. It also plots the output (both with the focal mechanism and without it.)
This program makes use of the obspy module of Python and retrieve data from the International Seismological Center (ISC) server


to run with default parameters


to obtain the earthquake info between today, March 2016 to the current time

python3 st=2016/3

to obtain the earthquake info between 2016, march, 29 to 2016, september, 22

python3 st=2016/3/29,et=2016/9/22

to obtain the catalog for magnitudes between 4-9 and with focal mechanism

python3 mxmag=9,mnmag=4,fm=yes

to obtain catalog for magnitude 4-7, and within radius of 10 (default) and central coordinates 22(lat),121(lon) with focal mechanism

python3 mxmag=7,mnmag=4,clat=22,clon=121,mxrad=80,fm=yes


Parameters to change (default values in the braces):

mnla(-90),mxla(90),mnlo(-180),mxlo(180),mndep(0),mxdep(700),mnmag(4),mxmag(10),mnrad(0),mxrad(10),clat(None),clon(None),st(-1 month),et (current time),fm(no)

Output format of the catalog


With Focal Mechanism -> “YEAR MONTH DAY HOUR MIN SEC LONGITUDE LATITUDE DEPTH EXP(Nm) M0 MAG Mrr Mtt Mpp Mrt Mtp Mpr Str1 Dip1 Rake1 Str2 Dip2 Rake2.”


  1. Python 3: Can be obtained from here.
  2. Extra Modules
    (a) requests: pip install requests
    (b) pandas: pip install pandas
    (c) obspy: pip install obspy
    (d) basemap: It is a great tool for creating maps using python in a simple way. The best way to obtain basemap is via Anaconda/MinicondaFor Linux/Mac users just download the bash installer and inside the download directory, execute the following command: bash Miniconda*.sh

    For Windows user, download the exe file and install it. After the installation of Anaconda/Miniconda, basemap can be obtained by simply executing the command conda install basemap

Click here
to download the programs.


Program for extracting the earthquake informations


Often in seismology, we require the information about the earthquake such as location, origin time, depth, magnitude, etc. To obtain this information, users generally, browse the web or if he/she is experienced then just go to the predetermined sites such as global CMT catalog. Even, after reaching the correct website, we need to do some dig up to get the desired information. This program is meant to scrape the global CMT website to obtain the earthquake information.

How to obtain the program

You can download the package from here.

How to use the program

The package you download from the given site consists of two python programs. These programs require pre-installed python-3.

You can obtain python from the following websites:
1. Python software
2. The user can also download the python package anaconda

After installing the python 3, to run the program, type:
Then enter the event recognition parameter such as
And enter. It will prompt the event information.

Screen Shot 2017-08-27 at 10.07.05 PM

Using the program as a module

You can also use the package as a python module. For using it as a module, save the program in the current directory or in the computer-discoverable path.
import EQinfo_extractor as eq_ext

Using Git and Github to store our programs (tutorial)

Git & GitHub tutorial

Git is a version control system for tracking changes in computer files. It was initially created by Linus Torvalds (creator of Linux system) in 2005.
We can use it to store any kinds of programs. It is distributed version control system. It means that many developers can work on the same project without being on the same network. It tracks each of the changes made to the files in the project. The user can revert to any file at any time of it had been committed to the repository. We can even look the snapshots of the code at any particular time in history. We can upload (or push) the files to the remote repository.

Initializing local git repository

git init

Configuring the Username and Email

git config --global 'utpalkumar'
git config --global ''

We can have a look at the configuration using the following command

git config --list

If you need any help, you can just type:

git help

Now, to add the files to the index and working tree (our final aim)

we use git add command
git add

To add all the files in the current directory

git add .
git add *.html

We can get the information about the tracked and untracked files. Tracked files are those which have been added to the working directory.

git status

If we make any changes to the files, we can inspect the changes using the diff command

git diff

If we want to remove the file from the index (untrack the file), then we can simply type:

git rm --cached

To remove the files from the index and the working tree:

git rm filename.txt

If we want to rename the file then we can do it using the git command too. We don’t need to untrack the file and then rename the original file and add it again

git mv filname.txt newfilename.txt

Now, we can commit the files to add to our repository on the GitHub.

There are two ways of doing that:
1. First way opens the vi or default directory on the local computer, and the user is prompted to enter the message. It is safe to enter meaningful messages because it is useful to track the changes made to the file.
git commit
2. The user can also enter the message using the -m flag
git commit -m 'made some changes'

To make changes to the committed files, the command is –amend

git commit --amend

If we don’t want to include some files in the current directory into the index or working tree, we can add the name of those files in the .gitignore file.

touch .gitignore

We can obtain the log of the git actions

git log --pretty=oneline
We can make this better formatted
git log --pretty=format:"%h : %an : %ar : %s"
For all commits within a week
git log --since=1.weeks
For all commits since some given date
git log --since="2014-01-12"
All the commits of a given author
git log --author="utpalkumar"
All commits before a given date
git log --before="2014-04-30"

If we are a group of developers. We don’t wish to add any changes to the repository without finishing a particular sub-project. We can avoid that by working in a branch

To create a branch other the main branch (master)
git branch mybranch
To switch to the new branch
git checkout mybranch

We can make any change in this branch and merge these changes in the master branch, we need to first switch to the master

git checkout master

to merge the changes to the original file

git merge mybranch -m "added mybranch"

For adding files to the remote git repository, we need to make account on the Github website and then we need to start a new repository, name it, give some required details then, we can add files from our local repository to the remote repository using the following commands:

git remote add origin https://repositoryaddress.git
Replace the above fake URL with the URL you get from the repository you create on the Github website.
git push

If we want to add the same files to another repository, we need to remove the added remove origin using the command

git remote rm origin # to remove the remote origin

We can download the git directory by simply using the git clone command

git clone https://repositoryaddress.git

Solving nth order differential equation for a given boundary condition

We are going to solve the differential equation with the boundary conditions

\psi_{xx} +[100 - \beta ]\psi = 0

\implies \psi_{xx} = [\beta - 100] \psi

Let’s take the simpler boundary condition

\psi(x=\pm 1) =0.

Now, the first thing which we wanna do is to convert this equation in the first order form. After that, we can simply use the 4th order Runge-Kutta to obtain higher order solution.

To transform our 2nd order differential equation into first order, we define two variables. To transform nth order equations we need n such variables. This is the standard process to convert any differential equation into the system of first order differential equations.

Lets take

y_1 = \psi (1)

\implies y_1' = \psi_x

\implies y_1' = y_2

y_2= \psi_x (2) [derivative of \psi]

\implies  y_2' = \psi_{xx}

\implies  y_2' = [\beta - 100] \psi

So, we have transformed the 2nd order differential equation into two first order differential equations.

And our boundary condition will become y_1(\pm 1) =0

Now for the differential equations of the type

\psi_{xx} = [\beta - 100] \psi

\implies \psi_{xx} = a \psi

have the solution \psi = e^{\pm \sqrt{a}x} (for positive a). Now this solution doesn’t satisfy our condition \psi(x=\pm 1) =0. Exponential functions are monotonic and hence cannot be 0 at two points. So, this tells us that a is not positive, i.e., \beta \ngtr 100.

Now, if \beta < 100, then our eqn becomes

\psi_{xx} = - a \psi

The solution to this is of the form of cosines and sines, which has potential to satisfy the boundary condition. So, we have \beta < 100.

Defining Function in MATLAB:

function [ rhs ] = yfunc( x,ic,dummy,beta )

rhs = [y2; (beta -100)*y1];


Calculating Solutions:

clear; close all; clc
set(0, 'DefaultAxesFontSize', 14)

tol=10^(-4); % define a tolerance level to be achieved
% by the shooting algorithm

xspan=[-1 1]; %boundary conditions % define the span of the computational domain

A=1; % define the initial slope at x=-1
ic=[0 A]; % initial conditions: x1(-1)=0, x1?(-1)=A

n0=100; % define the parameter n0
beta_start=n0; % beginning value of beta
for jj=1:5 % begin mode loop
    beta=beta_start; % initial value of eigenvalue beta
    dbeta=n0/100; % default step size in beta
    for j=1:1000 % begin convergence loop for beta
        [t,y]=ode45('yfunc',xspan, ic,[],beta); % solve ODEs
       if y(end,1)*((-1)^(jj+1)) > 0 %this checks if the beta needs to be higher or lower for convergence
           beta = beta-dbeta;
           beta = beta + dbeta/2; %uses bisection to converge to the solution
       if abs(y(end,1)) < tol % check for convergence
           betasol=[betasol beta]; % write out eigenvalue
           break % get out of convergence loop
    beta_start=beta-0.1; % after finding eigenvalue, pick
    % new starting value for next mode
    norm=trapz(t,y(:,1).*y(:,1)); % calculate the normalization
    plot(t,y(:,1)/sqrt(norm)); hold on
legend(sprintf('\\beta_1 = %.4f',betasol(1)),sprintf('\\beta_2 = %.4f',betasol(2)),sprintf('\\beta_3 = %.4f',betasol(3)),...
    sprintf('\\beta_4 = %.4f',betasol(4)),sprintf('\\beta_5 = %.4f',betasol(5)));
xlabel('x'); ylabel('\psi (x)','FontWeight','bold')



Spectral Analysis of wide variety of data in MATLAB

Signals can be any time-varying or space-varying quantity. Examples: speech, temperature readings, seismic data, stock price fluctuations.

A signal has one or more frequency components in it and can be viewed from two different standpoints: time-domain and frequency domain. In general, signals are recorded in time-domain but analyzing signals in frequency domain makes the task easier. For example, differential and convolution operations in time domain become simple algebraic operation in the frequency domain.

Fourier Transform of a signal:

We can go between time-domain and frequency domain by using a tool called Fourier transform. 

Now get comfortable with Fourier transform, let’s take an example in MATLAB:


clear; close all; clc

%%Creating dataset
fs=100;    %sampling frequency (samples/sec)
f1=10;  %frequency1
f2=20;  %frequency2
f3=30;  %frequency3


We represent the signal by the variable x. It is the summation of three sinusoidal signals with different amplitude, frequency and phase shift. We plot the signal first.

%%Visualizing data
legend('Two-tone signal')


We then do the Fourier transform of the signal to obtain its magnitude and phase spectrum.

%%Fourier transform of the data
X_mag=abs(fftshift(X));   %magnitude
X_phase=angle(fftshift(X)); %phase

%%Frequency bins

%%Magnitude response
title('FFT - Magnitude response')
grid on

%%Phase response
xlabel 'Frequency (Hz)'
ylabel 'Phase / \pi'
title('FFT - Phase response')
grid on

Power spectrum estimation using the Welch’s method:

Now we compute the power spectrum of the using the Welch’s method. We can use the function “pwelch” in Matlab to obtain the desired result.

pwelch(x,[],[],[],fs) %one-sided power spectral density


Pwelch is a spectrum estimator. It computes an averaged squared magnitude of the Fourier transform of a signal. In short, it computes a set of smaller FFTs (using sliding windows), computes the magnitude square of each and averages them.

By default, $latex x$ is divided into the longest possible segments to obtain as close to but not exceed 8 segments with 50% overlap. Each segment is windowed with a Hamming window. The modified periodograms are averaged to obtain the PSD estimate. If you cannot divide the length of $latexx$  exactly into an integer number of segments with 50% overlap, $ latex x$ is truncated accordingly.

Note the peak at 10, 20 and 30 Hz. Also, note the display default for Pwelch is in dB (logarithmic).

Let us inspect another data using the “pwelch” function of the Matlab.

clear; close all; clc

load officetemp1; %16 weeks 2 samples/hr

%%Visualizing data
ylabel(sprintf('Temperature (%c C)', char(176)))
xlabel('Time (weeks)')
title('Temperature Readings')
grid on


tempnorm=tempC-mean(tempC); %temperature residual

fs=2*24*7; %2 samples per hour

axis([0 16 -30 0])


Resolving two close frequency components using “pwelch”:

Let’s first plot the data (having the frequency components at 500 and 550 Hz) using the default parameters of “pwelch” function:

clear; close all; clc
load winLen;


title('pwelch with default input- f1: 500Hz, f2: 550Hz')



Here, we can see that the frequency component at 500Hz can be resolved but the frequency component at 550 Hz is barely visible.

One can obtain better frequency resolution by increasing the window size:

filename = 'pwelchWindowAnalysis.gif';
for len=500:300:N
    title(sprintf('Hamming Window size: %d',len))
    frame = getframe(1);
      im = frame2im(frame);
      [imind,cm] = rgb2ind(im,256);
      if len == 500;
          imwrite(imind,cm,filename,'gif', 'Loopcount',inf);


Here, we can see that by increasing the window width, we can resolve the two components. By increasing the window width, we lose the temporal resolution but at the same time, we gain the spectral resolution.

Resolving the frequency component using the Kaiser window:

“pwelch” function uses Hamming window by default.

L = 64;


Because of the inherent property of the Hamming window (fixed side lobe), sometimes the signal can get masked.

clear; close all; clc

load winType;

title('pwelch with default input- f1: 500Hz, f2: 5kHz')


In the above figure, we can see that the frequency component of the signal at 5kHz is barely distinguishable.

To resolve this component of frequency, we use Kaiser window instead of the default Hamming window.


Kaiser Window in Frequency domain
Kaiser Window in time domain



% %% Kaiser window
h1=kaiser(len,4.53); %side lobe attenuation of 50dB
title('Kaiser window with 50dB side lobe attenuation')

% %% Kaiser window
h2=kaiser(len,12.26); %side lobe attenuation of 50dB
title('Kaiser window with 120dB side lobe attenuation')

To obtain a Kaiser window that designs an FIR filter with sidelobe attenuation of α dB, we use the following β :


Screen Shot 2017-05-14 at 2.07.32 PM.jpg



Amplitude of one frequency component is much lower than the other:

We can have some signal where the amplitude of one frequency component is much lower than the others and is inundated in noise.

To deal with such signals, we need to get rid of the noise using the averaging of the window.

clear; close all; clc

load winAvg; 

% %% Kaiser window
h1=kaiser(len,4.53); %side lobe attenuation of 50dB
title('Kaiser window with 50dB side lobe attenuation')


In the above signal plot, the second frequency component at 14kHz is undetectable. We can get rid of the noise using the averaging approach.

filename = 'pwelchAveraging.gif';
for len=2050:-100:550
    h2=kaiser(len,4.53); %side lobe attenuation of 50dB
    frame = getframe(1);
      im = frame2im(frame);
      [imind,cm] = rgb2ind(im,256);
      if len == 2050;
          imwrite(imind,cm,filename,'gif', 'Loopcount',inf);

Here, we take the smaller window in steps to show the effect of averaging. Smaller window in frequency domain is equivalent to the larger window in time domain.


Dealing with data having missing samples:

clear; close all; clc
%%Signals having missing samples
load weightData;
ylabel('Weight in kg'); xlabel('Day of the year')
grid on; axis tight; title('Weight readings of a person')


In case, we have missing samples in the data, i.e., the data is not regularly recorded, then we cannot apply the “pwelch” function of matlab to retrieve its frequency components. But thankfully, we have the function “plomb” which can be applied in such cases.

xlabel('Cycles per week');ylabel('Magnitude')
grid on; axis tight


Analyzing time and frequency domain simultaneously:

Sometimes, we need the time and frequency informations simultaneously. For a long series of data, we need to know which frequency component is recorded first and at what time. This can be done by making the spectrogram.

clear; close all; clc
load dtmf;



% %%
spectrogram(x,[],[],[],Fs,'yaxis'); colorbar; %default window is hamming window


Since, the time resolution is higher for smaller window and frequency resolution is lower at small window length. So there is the trade off between the time and frequency domain window length. We need to figure out the optimal point for better resolution in both time and frequency domain.

segLen = [120, 240,480,600,800,1000,1200,1600]; 

for itr = 1:numel(segLen)
    title(sprintf('window length: %d',segLen(itr)))
    frame = getframe(1);
      im = frame2im(frame);
      [imind,cm] = rgb2ind(im,256);
      if itr == 1;
          imwrite(imind,cm,filename,'gif', 'Loopcount',inf);


Calculating the curvature of a curve

Screenshot from 2017-05-03 16-46-57.png


The curvature of the curve is the amount by which it deviates from being a straight line. It is defined as the reciprocal of the radius of the best fitting circle. So, the curvature of a straight line is zero.  The curvature of a circle of radius R should be large if R is small and vice-versa.

Screenshot from 2017-05-03 16-41-38

Screenshot from 2017-05-03 16-42-03

Download pdf file, please click here.

Locating Earthquake using Geiger’s Method

Earthquake location problem is old, however, it is still quite relevant. The problem can be stated as to determine the hypocenter (x0,y0,z0) and origin time (t0) of the rupture of fault on the basis of arrival time data of P and S waves. Here, we have considered hypocenter in the cartesian coordinate system. It can be readily converted to geographical coordinates using appropriate conversion factor. The origin time is the start time of the fault rupture.

Here, our data are the P and S arrival times at the stations located at the surface (xi,yi,0). We have also assumed a point source and a constant velocity medium for the simplicity sake.

The arrival time of the P or S wave at the station is equal to the sum of the origin time of the earthquake and the travel time of the wave to the station.

The arrival time, ti=Ti + t0, where Ti is the travel time of the particular phase and t0 is the origin time of the earthquake.

Ti = sqrt((xi-x0)^2 + (yi-y0)^2 + z0^2)/v; where v is the P or S wave velocity.

Now, we have to determine the hypocenter location and the origin time of the Earthquake. This problem is invariably ill-posed as the data are  contaminated by noise. So, we can pose this problem as an over-determined case. One way to solve this inconsistent problem for the hypocenter and origin time is to minimize the least square error of the observed and predicted arrival times at the stations.

Before that, we need to formulate the problem. As evident from the above equation, the expression for arrival time and eventually the least square error function is nonlinear. This nonlinear inverse problem can be solved using the Newton’s method. We will utilize the derivative of the error function in the vicinity of the trial solution (initializing the problem with the initial wise guess) to devise a better solution. The next generation of the trial solution is obtained by expanding the error function in a Taylor series about the trial solution. This process continues until the error function becomes considerably small. Then we accept that  trial solution as the estimated model parameters.

d = G m, where d is the matrix containing the arrival time data, m is the matrix containing the model parameters and G is the data kernel which maps the data to the model.

There are some instances where the matrix can become underdetermined and the least square would fail such as when the data contains only the P wave arrival times. In such cases the solution can become nonunique. We solve this problem as a damped least square case. Though it can also be using the singular value decomposition which allows the easy identification of the underdetermined case and then partitioning the overdetermined and underdetermined in upper and lower part of the kernel matrix and dealing with both part separately.

m = [G’G + epsilon]G’d

The parameter epsilon is chosen by trial and error to yield a solution that has reasonably small prediction error.

In this problem, we deduce the gradient of the travel time by examining the geometry of the ray as it leaves the source. If the earthquake is moved a small distance s  parallel to the ray in the direction of the receiver, then the travel time is simply decreased by s/v, where v is the velocity of the phase. If the earthquake is moved a small distance perpendicular to the ray, then the change in travel time is negligible since the new path will have nearly same length as the old one.

T = r/v

If the earthquake is moved parallel to the ray path,

T” = (r-s)/v = r/v – s/v

delta T = T” – T = r/v – s/v – r/v  = – s/v

If the earthquake has moved perpendicular to the raypath,

delta T = 0

So, delta T = – s/v, where is a unit vector tangent to the ray at the source and points toward the receiver.

This is the Geiger’s Method, which is used to formulate the data kernel matrix.

Let us take a hypothetical earthquake location problem and solve it using this method.


clear; close all;
global G epsilon;

% Velocity parameters
vpvs = 1.78;

% defining the region of the problem which contains both the source and the receiver
%We take a 100x100 units^2 area and the depth of 100 units. The surface is 0 units depth.

% stations: x, y, z coordinates (xi, yi, 0)
sxm = [-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]'*ones(1,9)*10;
sym = 10*ones(9,1)*[-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9];
sx = sxm(:);
sy = sym(:);
Ns = length(sx); % num of stations
sz = zmax*ones(Ns,1); %zeros

% true earthquakes locations
Ne = 20; % number of earthquakes
M = 4*Ne; % 4 model parameters, x, y, z, and t0, per earthquake
extrue = random('uniform',xmin,xmax,Ne,1); %x0
eytrue = random('uniform',ymin,ymax,Ne,1); %y0
eztrue = random('uniform',zmin,zmax,Ne,1); %z0
t0true = random('uniform',0,0.2,Ne,1); %t0
mtrue = [extrue', eytrue', eztrue', t0true']'; %column matrix containing all the true model parameter values

Nr = Ne*Ns; % number of rays, that is, earthquake-stations pairs
N = 2*Ne*Ns; % total data; Ne*Ns for P and S each

% Generating the true data based on the true earthquake and station locations
dtrue=zeros(N,1); %allocating space for all data
for i = [1:Ns] % loop over stations
for j = [1:Ne] % loop over earthquakes
 dx = mtrue(j)-sx(i); % x component of the displacement obtained by the difference between station and source location x component
 dy = mtrue(Ne+j)-sy(i); % y component of the source- receiver displacement
 dz = mtrue(2*Ne+j)-sz(i); % z component of displacement between source and receiver
 r = sqrt( dx^2 + dy^2 + dz^2 ); % source-receiver distance
 dtrue(k)=r/vp+mtrue(3*Ne+j); %P arrival time for each station-source pair obtained by summing the travel time with the origin time of the earthquake
 dtrue(Nr+k)=r/vs+mtrue(3*Ne+j); % S arrival time
%  Generating observed data by adding gaussian noise with standard deviation 0.2 
sd = 0.2;
dobs=dtrue+random('normal',0,sd,N,1); % observed data

%% Determining the predicted arrival time using the Geiger's Method
% inital guess of earthquake locations
mest = [random('uniform',xmin,xmax,1,Ne), random('uniform',ymin,ymax,1,Ne), ...
 random('uniform',zmin+2,zmax-2,1,Ne), random('uniform',-0.1,0.1,1,Ne) ]';
% Here, we take a random initial guess

%Formulating the data kernel matrix and estimating the predicted models
for iter=[1:10] %for 10 iterations (termination criteria)
 % formulating data kernel
 G = spalloc(N,M,4*N); %N- total num of data,2*Ne*Ns; M is total num of model
 %parameters, 4*Ne
 dpre = zeros(N,1); %allocating space for predicted data matrix
 for i = 1:Ns % loop over stations
 for j = 1:Ne % loop over earthquakes
 dx = mest(j)-sx(i); % x- component of displacement obtained using the initial guess
 dy = mest(Ne+j)-sy(i); % y- component of displacement
 dz = mest(2*Ne+j)-sz(i); % z- component of displacement
 r = sqrt( dx^2 + dy^2 + dz^2 ); % source-receiver distance for each iteration
 k=(i-1)*Ne+j; %index for each ray
 dpre(k)=r/vp+mest(3*Ne+j); %predicted P wave arrival time
 dpre(Nr+k)=r/vs+mest(3*Ne+j); %predicted S wave arrival time
%First half of data kernel matrix correspoding to P wave
 G(k,j) = dx/(r*vp); % first column of data kernel matrix
 G(k,Ne+j) = dy/(r*vp); % second column of data kernel matrix
 G(k,2*Ne+j) = dz/(r*vp); % third column of data kernel matrix
 G(k,3*Ne+j) = 1; % fourth column of data kernel matrix
% Second half of the data kernel matrix corresponding to S wave
 G(Nr+k,j) = dx/(r*vs);
 G(Nr+k,Ne+j) = dy/(r*vs);
 G(Nr+k,2*Ne+j) = dz/(r*vs);
 G(Nr+k,3*Ne+j) = 1;
 % solve with dampled least squares
 dd = dobs-dpre;
 dm=bicg(@dlsfun,G'*dd,1e-5,3*M); solving using the biconjugate method
%solving the damped least square equation G'dd = [ G'G + epsilon* I] dm
% We use biconjugate method to reduce the computational cost (see for the dlsfun at the bottom)
 mest = mest+dm; %updated model parameter

% Generating the final predicted data
for i = 1:Ns % loop over stations
for j = 1:Ne % loop over earthquakes
 dx = mest(j)-sx(i);
 dy = mest(Ne+j)-sy(i);
 dz = mest(2*Ne+j)-sz(i);
 r = sqrt( dx^2 + dy^2 + dz^2 );
 dpre(k)=r/vp+mest(3*Ne+j); % S-wave arrival time
 dpre(Nr+k)=r/vs+mest(3*Ne+j); % P- wave arriavl time

% Calculating the data and model misfit
expre = mest(1:Ne); % x0
eypre = mest(Ne+1:2*Ne); %y0
ezpre = mest(2*Ne+1:3*Ne); %z0
t0pre = mest(3*Ne+1:4*Ne); %t0
dd = dobs-dpre; %residual of observed and predicted arrival time
E = dd'*dd; %error
fprintf('RMS traveltime error: %f\n', sqrt(E/N) );
Emx = (extrue-expre)'*(extrue-expre); %misfit for x0
Emy = (eytrue-eypre)'*(eytrue-eypre); %misfit for y0
Emz = (eztrue-ezpre)'*(eztrue-ezpre); %misfit for z0
Emt = (t0true-t0pre)'*(t0true-t0pre); %misfit for t0
fprintf('RMS model misfit: x %f y %f z %f t0 %f\n', sqrt(Emx/Ne), sqrt(Emy/Ne), sqrt(Emz/Ne), sqrt(Emt/Ne) );
Earthquake location Example (William Menke, 2012) : The black triangles are the station locations, green circles are true earthquake location and the blue circles are predicted earthquake location.


function y = dlsfun(v,transp_flag)
global G epsilon;
temp = G*v;
y = epsilon * v + G'*temp;