Preparing the NCDC Weather Data for Hadoop

I’m exploring Hadoop with the book Hadoop: The Definitive Guide. Appendix A shows how to download NCDC Weather data from S3 and put it into Hadoop. I didn’t want to download from S3 or load the entire dataset so here’s what I did instead.

Here’s a little bash script I used to download the data. You might want to do this if you want more up-to-date data, or if you only want to work with a subset. If you only want data for a certain year just append that year to the url in $source_url.
#!/bin/bash 
 
source_url="ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/"; 
download_to="~/ncdc_data"; 
 
if [ ! -d "$download_to" ]; then 
    mkdir "$download_to"; 
fi 
 
wget -r -c --progress=bar --no-parent -P "$download_to" "$source_url";

I’ve modified the script from the Hadoop book to work with local files. I’m just working with files from 2012. Modify the url in target if you want something different.

#!/usr/bin/env bash 
 
# NCDC Weather file to load into hadoop 
target="/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012"; 
 
# Un-gzip each station file and concat into one file 
echo "reporter:status:Un-gzipping $target" >&2 
for file in $target/* do 
    gunzip -c $file >> $target.all 
    echo "reporter:status:Processed $file" >&2 
done 
 
# Put gzipped version into HDFS 
echo "reporter:status:Gzipping $target and putting in HDFS" >&2 
gzip -c $target.all | $HADOOP_INSTALL/bin/hadoop fs -put - gz/$target.gz

The script will unzip all the files, combine them, you should see output similar to this.

reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-94996-2012.gz 
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-96404-2012.gz 
reporter:status:Processed /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012/999999-99999-2012.gz

When it’s finished combining all the files it will store the data in Hadoop.

 reporter:status:Gzipping /home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012 and putting in HDFS 13/01/11 21:37:52 
INFO util.NativeCodeLoader: Loaded the native-hadoop library

Once the process has completed you should be able to confirm the storage of your data in Hadoop with the following command;

 rhys@linux-g1rx:~/hadoop_scripts> hadoop fs -ls gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz
 Found 1 items 
-rwxrwxrwx 1 rhys users 4870924294 2013-01-11 23:11 /home/rhys/hadoop_scripts/gz/home/rhys/ncdc_data/ftp3.ncdc.noaa.gov/pub/data/noaa/2012.gz

Now I have data in Hadoop it’s time to start writing MapReduce jobs!


One Comment

Leave a Reply

Current ye@r *