{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "limiting-bleeding",
   "metadata": {},
   "source": [
    "# Bulk GRB product downloads from the UKSSDC\n",
    "\n",
    "V1.0. Phil Evans, UKSSDC, University of Leicester.  2021 April 30\n",
    "\n",
    "## Introduction\n",
    "\n",
    "I periodically get requests from people along the lines of, \"I want to download all of the light curves for GRBs with redshifts, is there an easy way to do it?\" And my answer is always a variant on \"Not yet, it's on my to do list\".\n",
    "\n",
    "This is something I intend to do 'properly', when other ongoing projects allow. I would like a system that lets you select GRB properties (either from the XRT catalogue provided by the UKSSDC at https://www.swift.ac.uk/xrt_live_cat; or from the Swift GRB data table provided by the SDC at https://swift.gsfc.nasa.gov/archive/grb_table), and then lets you select what products to download, and then gives the to you.\n",
    "\n",
    "This is non-trivial, not least providing a generic interface for filtering on any of the available data columns, some of which may be blank.\n",
    "\n",
    "However, it is possible for the user (that's you) to identify the GRBs you are interested in, and then to download the files yourself, *if* you knew how those files were organised at the UKSSDC. This information is not secret, it is, however, a little esoteric.\n",
    "\n",
    "The purpose of this notebook is to make it easy (OK, easier) for you as a user to carry out bulk downloads, until I have created a client-side interface for this.\n",
    "\n",
    "## Scope of this demonstration\n",
    "\n",
    "This notebook will download all available light curves and burst analyser data for the short GRBs. Here, the short GRBs are defined as \"Those with a T90 value in the SDC data table, that is <2 s\".\n",
    "\n",
    "**This demonstration is not intended for scientific use**\n",
    "\n",
    "It will quickly be seen that this is a totally arbitrary definition of short GRBs, is dependent on the accuracy of the SDC data table, and the way in which those entries are parsed. For proper scientific analysis, steps 2.5 and 3 should be replaced with careful data selection processes.\n",
    "\n",
    "## Alternative uses\n",
    "\n",
    "This approach selects GRBs based on the SDC data table contents (steps 2-3). You don't have to do this. If you have a list of GRB names, created however you like (including as the output of an entirely separate selection process), you can still run steps 1 and 4 (probably with small edits) to convert the GRB names into their indexed targetIDs, and thus URLs, and download the code.\n",
    "\n",
    "## Notes\n",
    "\n",
    "* Even if you just want to run this demo **please** read everything, including the Python code first and check you know what it's doing. I cannot be held responsible if you overwrite files (for example) because you didn't check where and how this notebook will save what it downloads.\n",
    "\n",
    "* I am still relatively new to Python, and have not fully grasped the most efficient and 'pythonic' ways to do things. In particular, the process of converting the T90 values to numbers (step 2.5) looks somewhat convoluted to me (I could do it in 1 line of Perl); please feel free to modify this, and indeed to send me any improvements you can make."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "secret-crime",
   "metadata": {},
   "source": [
    "## Requirements\n",
    "This is a python 3 notebook; it was built using python 3.8 but should work with earlier version of python 3, although if you go before about 3.6 you'll have to replace the fstrings with older ways of embedding text.\n",
    "\n",
    "We are going to require some python modules: so lets import them now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "intellectual-seafood",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import requests\n",
    "import io\n",
    "import re\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "revolutionary-resort",
   "metadata": {},
   "source": [
    "## Step 1: Collect the name-targetID lookup\n",
    "\n",
    "GRB data at the UKSSDC are indexed by the primary targetID, rather than GRB name. Some GRBs have multiple targetIDs (e.g. things covered with tiling). It is expected that you will have (or create) a set of GRBs recorded by their names, so I provide an index online (https://www.swift.ac.uk/xrt_curves/grb.list) from name to the primary targetID.\n",
    "\n",
    "Here, we will download this an ingest it into a dict. The list should not contain duplicates, but it's always best to check I haven't made a mistake, rather than taking it for granted, so we will do so.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "active-manufacturer",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Quick test of downloading data:\n",
    "r = requests.get('https://www.swift.ac.uk/xrt_curves/grb.list')\n",
    "if r.status_code != 200:  \n",
    "    raise RuntimeException(\"Error downloading GRB look up table from the UKSSDC\")\n",
    "    \n",
    "lookup = {}\n",
    "\n",
    "for l in r.text.split('\\n'):\n",
    "    if l == \"\":\n",
    "        continue\n",
    "    (n,t) = l.split('\\t')\n",
    "    if n in lookup:\n",
    "        print(f\"Warning: `{n}` has multiple entries {lookup[n]}, only accepting first, skipping `{t}`\")\n",
    "    else:\n",
    "        lookup[n] = t\n",
    "        #print(f\"Saving {n} = {t}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "powerful-charleston",
   "metadata": {},
   "source": [
    "## Step 2: Load the SDC Swift data table\n",
    "\n",
    "For this demonstration we are going to use the SDC Swift data table as our source of information about the GRBs. This table appears to be designed mainly for human reading, not machine reading, so downloading and handling it is a bit involved. Luckily (for you), I've worked through that, as below.\n",
    "\n",
    "This is a multi-step process because the URL for the text file is only available from the fullview, and appears to be changable.\n",
    "\n",
    "**Note** if the format of the SDC 'fullview' page changes, then this code will need adapting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "received-southwest",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, get the fullview web page:\n",
    "r = requests.get('https://swift.gsfc.nasa.gov/archive/grb_table/fullview/')\n",
    "if r.status_code != 200:  \n",
    "    raise RuntimeException(\"Error downloading GRB data table from the SDC\")\n",
    "\n",
    "# This is HTML, and I want to find the line that links to the text file:\n",
    "allLines = r.text.split('\\n')\n",
    "tabLine = [l for l in allLines if 'tab-delimited text file' in l]\n",
    "\n",
    "# Now look in that file for the url in the link, throw an error if there isn't one.\n",
    "match = re.search('a href=\"(.+)\">', tabLine[0])\n",
    "url = \"\"\n",
    "if match:\n",
    "    url = match.group(1)\n",
    "else:\n",
    "    raise RuntimeException(\"There is no text table linked. Bummer\")\n",
    "\n",
    "# OK, get the tsv. Note that the link starts with a /, so I need the base path.\n",
    "r = requests.get(\"https://swift.gsfc.nasa.gov\"+url)\n",
    "if r.status_code != 200:  \n",
    "    raise RuntimeException(\"Error downloading GRB data table from the SDC (2)\")\n",
    "\n",
    "# Right, we have the data, now lets read it into a pandas dataframe.\n",
    "# read_csv expects a file, but io.StringIO does some magic that tricks pandas into thinking that\n",
    "# my variable contents are in a file.\n",
    "SDC = pd.read_csv(io.StringIO(r.text), sep='\\t', header=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fantastic-standard",
   "metadata": {},
   "source": [
    "### Step 2.5 Munge the data a bit\n",
    "\n",
    "We now have a `pandas` DataFrame containing the SDC table, but its designed mainly for human reading, not computer reading, so we probably want to do some work to make it easier to use. First of all, let's just see what the column headers are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cordless-radar",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's see what the column names are so we know how to index:\n",
    "SDC.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "colored-observation",
   "metadata": {},
   "source": [
    "Yikes; I don't fancy typing those into my code every few lines. Thankfully, we can rename them. In this demo we are only concerned with T90, so I'm just going to ranem that column. To rename more columns, just add extra items to the dictionary supplies as the `columns` argument below, the format being: `oldName: newName`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "civil-bundle",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "SDC.rename(columns={'BAT T90 [sec]':'T90'}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "loaded-convertible",
   "metadata": {},
   "source": [
    "It turns out that the T90 column is not always numeric, it includes values like \"~2.5\" which are not great for computers to read, and certainly not for filtering. So we need to do some manual work here to fix that column.\n",
    "\n",
    "**Really important note**\n",
    "\n",
    "> This is one of those points where, for proper scientific data selection, careful work is needed. \n",
    "> In this demonstration, I'm doing something very basic, with no justification at all\n",
    "\n",
    "In this example I'm going to reject any GRB with anything other than a number in its T90 column, and since I'll be selecting short GRBs (based on T90<2) I will simply set T90=1000000 for those:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "lyric-interim",
   "metadata": {},
   "outputs": [],
   "source": [
    "# First, create a function to do the filtering I want to do:\n",
    "\n",
    "def removeText(val):\n",
    "    if type(val) == str:\n",
    "        if re.search('[^\\-+\\d\\.]', val):\n",
    "            val=1000000.\n",
    "        else:\n",
    "            val=float(val)\n",
    "    elif math.isnan(val):\n",
    "        val=1000000.\n",
    "    return val\n",
    "\n",
    "# Now apply it. I'm going to make a new column, in case I mess up, or want to still access the original.\n",
    "SDC['fixT90'] = SDC['T90'].apply(removeText)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "streaming-sport",
   "metadata": {},
   "source": [
    "## Step 3: Filter on the list we want\n",
    "\n",
    "Now we can do the filtering. Here, it's an easy filter on my new column. You may want to replace this with something more complex, or different, according to need.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "baking-diabetes",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply some filter in pandas, e.g. T90<2s\n",
    "shortGRBs = SDC[SDC['fixT90']<2.0]\n",
    "\n",
    "# Report how many there are - useful for diagnostics above; in developing this notebook this kept coming out at 0\n",
    "# so the print is a helpful check that the filtering stages have worked.\n",
    "print (f\"There are {len(shortGRBs)} short GRBs\")\n",
    "\n",
    "# You may also want to view the table just to check that it looks sane, if so, uncomment this next line:\n",
    "# shortGRBs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "exciting-newsletter",
   "metadata": {},
   "source": [
    "## Step 4: Loop over GRBs and download data\n",
    "\n",
    "In this code, we're going to download light curves, and all the burst analyser data, and we're going to get, for each dataset, the zip file containing all of the light curves.\n",
    "\n",
    "The format of the URLs is straight forward:\n",
    "\n",
    "### Light curves\n",
    "* `https://www.swift.ac.uk/xrt_curves/$targetID/lcfiles_$targetID.zip`\n",
    "\n",
    "### Burst analyser\n",
    "* `https://www.swift.ac.uk/burst_analyser/$targetID/batxrtfiles_$targetID.zip`\n",
    "\n",
    "### Spectra\n",
    "* `https://www.swift.ac.uk/xrt_spectra/$targetID/interval0.tar.gz` (time averaged)\n",
    "* `https://www.swift.ac.uk/xrt_spectra/$targetID/late_time.tar.gz` (excluding first orbit)\n",
    "\n",
    "The targetID needs to be that under which the products are linked at the UKSSDC (some GRBs have multiple targetIDs), which is why we made the lookup file at the start."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "vulnerable-independence",
   "metadata": {},
   "outputs": [],
   "source": [
    "savePath = '/tmp' # Edit this to where you want to save things!\n",
    "\n",
    "for index, row in shortGRBs.iterrows():\n",
    "    \n",
    "    name = 'GRB '+row['GRB']\n",
    "    print(f\"Trying to download {name}\")\n",
    "    \n",
    "    if name in lookup:\n",
    "        targetID = lookup[name]\n",
    "        print (f\" -> targetID = {targetID}\")\n",
    "        # Let's lose the spaces from the name, for the purposes of saving\n",
    "        outName=name.replace(\" \",\"\")\n",
    "        \n",
    "        # Get the light curve\n",
    "        url = f\"https://www.swift.ac.uk/xrt_curves/{targetID}/lcfiles_{targetID}.zip\"\n",
    "        r = requests.get(url)\n",
    "        if r.status_code != 200:  \n",
    "            print(f\"    WARNING: Cannot download light curve, probably none exists for {name}\")\n",
    "        else:\n",
    "            outfile = f\"{savePath}/{outName}_lc.zip\"\n",
    "            file = open(outfile, \"wb\")\n",
    "            file.write(r.content)\n",
    "            file.close()\n",
    "            print(f\"  Light curve downloaded as {outfile}\")\n",
    "        \n",
    "        # Get the burst analyser\n",
    "        url = f\"https://www.swift.ac.uk/burst_analyser/{targetID}/batxrtfiles_{targetID}.zip\"\n",
    "        r = requests.get(url)\n",
    "        if r.status_code != 200:  \n",
    "            print(f\"    WARNING: Cannot download burst analyser, probably none exists for {name}\")\n",
    "        else:\n",
    "            outfile = f\"{savePath}/{outName}_burstAn.zip\"\n",
    "            file = open(outfile, \"wb\")\n",
    "            file.write(r.content)\n",
    "            file.close()\n",
    "            print(f\"  Burst analyser downloaded as {outfile}\")\n",
    "    else:\n",
    "        print(f\"    {name} is not in the list of GRBs with UKSSDC data products\")\n",
    "\n",
    "    print (\"\\n\\n ** OK, downloads complete **\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}