Friday, August 14, 2015

Bash: Nasty data reformat

I received a dataset composed by several files with this layout:

 A20C3M1_precip_1930_Europe.ascii
 A20C3M1_precip_1931_Europe.ascii
 A20C3M1_precip_1932_Europe.ascii
 A20C3M1_precip_1933_Europe.ascii
 A20C3M1_precip_1934_Europe.ascii
 A20C3M1_precip_1935_Europe.ascii
 ...

Each file had data like:

  36.761, -15.000,  21.732,  12.170,   7.797,  23.551,  12.186,  19.196,  30.779,  27.348,  32.999,  31.810,  16.109,  17.105
  39.296, -15.000,  84.459,   3.269,   5.787,  19.614,  19.731,  11.071,  25.962,  20.152,  20.350,  19.952,   9.661,  16.531
  41.831, -15.000,  99.225,   0.090,   0.000,   9.166,  18.276,   0.000,   0.000,   0.000,   6.851,  31.107,   3.937,  19.640
  44.366, -15.000,  64.088,   0.000,   0.000,   2.215,   7.657,   0.410,   0.000,   0.000,   6.258,  35.478,   0.388,  15.106
  ...

So I had a year series, a file for each year. Now inside the file I had a pair of coordinates (lat,lon) in the fist two columns denoting a "weather station". Then columns 3 to 14 marked a precipitation value for each month of that year. My mission was to reformat this info in one file for each station, having the values in this layout:

 STATION_[36.761,0.000].dat
 1930  1 3676
 1930  2 0
 1930  3 6439
 1930  4 7340
 1930  5 5179
 1930  6 5577
 1930  7 6001
 1930  8 5937
 1930  9 7331
 1930  10 3211
 1930  11 2632
 1930  12 9107
 1931  1 3676
 1931  2 0
 1931  3 11038
 1931  4 3409
 1931  5 6754
 1931  6 3821
 1931  7 2231
 1931  8 7168
 1931  9 6209
 1931  10 10026
 1931  11 8913
 1931  12 14465
 1932  1 3676
 1932  2 0
 1932  3 9029
 ...

which is the input format for the SPI program. My mission was to use the program with the provided data, so I had to reformat the entire dataset. This bash script does the job (perhaps not very time-efficient, but effective!!)

#!/bin/bash
##########################################################################
## Reformats the precipitation dataset to SPI input format
## Manuel Arturo Izquierdo  (c) 2011
## Run this script inside the directory with the origianl ascii files
##########################################################################


mkdir SPI_INPUT
header=1

# Use A20* for the A20C3M1 dataset
#for dataf in A20*  
# Use HLG* for the HLGM15B dataset
for dataf in HLG*
do
    year=`echo $dataf | cut -d'_' -f4` #use -f4 for HLG* , -f3 for A20*
    year=`echo $year+1700 |bc` #  HLG* only

{ 

    echo "Year: $year"

    while : 
    do 
    read a 
    if test "$a" == ""
    then
       break
    fi
    lon=`echo $a | cut -d',' -f1`
    lat=`echo $a | cut -d',' -f2`
    coord=`echo $lon$lat|cut -d' ' -f1`,`echo $lon$lat|cut -d' ' -f2`
    station='STATION_['$coord'].dat'
    if test "$header" -eq "1"
    then
       echo $station >> SPI_INPUT/$station
    fi
  
    for m in `seq 3 14`
    do
      let month=$m-2 
      aridity=`echo $a | cut -d',' -f$month`
      aridity=`echo "(($aridity*100)+0.5)/1"| bc` #aridity x 100 and integer
      echo "$year  $month $aridity" >> SPI_INPUT/$station
   
    done
  
    done 
} < $dataf

header=`echo $header+1|bc`

done


The script generates a collection of files STATION_[''lat'',''lon''].dat for each station, in the required format. I found the trick to convert from float to int using bc: http://www.alecjacobson.com/weblog/?p=256. et voilĂ !!

No comments: