How to Politely Download All English Language Text Format Files from Project Gutenberg

November 1st, 2014 Permalink

There are plenty of projects that can make good use of a large assembly of written works in a specific language. Training neural networks, testing natural language processing, use of Markov chains for generating content, and so on and so forth. Project Gutenberg is a repository of some 46,000 texts in various languages that are unencumbered by copyright, so their use in any project should be safe enough from the intellectual property annoyances that plague the world these days. The idea that anyone can actually own an arrangement of data is both ridiculous and outrageous, but those who can successfully use the apparatus of government for the purposes of rent seeking on the back of this assertion are unlikely to stop any time soon. So it is good that there exist bodies of work that allow us to ignore this horrible state of affairs and just get on with producing better arrangements of data.

How to get a hold of the Project Gutenberg files, however? There are torrents of DVD images, but these are years behind the times and contain duplicate copies of the works in various different formats - so are much larger than you might need. Still, this is good enough, perhaps. Set the torrent download running and come back when it is done. The complete set of up to date files are available online at the Project Gutenberg website, however, and so could be downloaded from there. This would allow you to pick and choose the formats and files you want, such as restricting yourself to plain text rather than all of the various ebook formats.

If you are going to do this the important thing to remember is that Project Gutenberg is a non-profit, volunteer organization, and it is intensely impolite to hammer their website in the process of downloading tens of thousands of files. If you are impatient, just download the DVD torrent. The Project Gutenberg website provides some tools to make it easy to obtain lists of files for automated downloading, and in return we should be polite and download files at an acceptably sedate rate. The most useful tool is the harvest page, which allows easy spidering of a restricted subset of the collection of books.

A harvest page such as http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en&offset=100 provides a paged list of zip file URLs for a specific language and format - plain text English in this case. Notice the "Next Page" link at the very end. A suitably configured spider will pick up on that and so work its way through the entire collection. Fortunately you don't even have to write such a spider, as the standard issue wget command has a spider mode that does this all for you.

Here is a script that downloads and unzips all of the Project Gutenberg plain text English works at an acceptably slow rate. It can be stopped and will pick up again from where it was interrupted. As of late 2014 this will generate about 30G in files, 7G zipped and 21G unzipped, and take 36 to 48 hours to run on a fast connection as it pauses between downloads and doesn't try to download multiple files at once. Still, you only have to download these files once.

#!/bin/bash
#
# Download the complete archive of text format files from Project Gutenberg.
#
# Estimated size in Q2 2014: 7G in zipfiles which unzip to about 21G in text
# files. So have 30G spare if you run this.
#
# Note that as written here this is a 36 to 48 hour process on a fast
# connection, with pauses between downloads. This minimizes impact on the
# Project Gutenberg servers.
#
# You'll only have to do this once, however, and this script will pick up from
# where it left off if it fails or is stopped.
#

# ------------------------------------------------------------------------
# Preliminaries
# ------------------------------------------------------------------------

set -o nounset
set -o errexit

# Restrict downloads to this file format.
FORMAT=txt
# Restrict downloads to this language.
LANG=en

# The directory in which this file exists.
DIR="$( cd "$( dirname "$0" )" && pwd)"
# File containing the list of zipfile URLs.
ZIP_LIST="${DIR}/zipfileLinks.txt"
# A subdirectory in which to store the zipfiles.
ZIP_DIR="${DIR}/zipfiles"
# A directory in which to store the unzipped files.
UNZIP_DIR="${DIR}/files"

mkdir -p "${ZIP_DIR}"
mkdir -p "${UNZIP_DIR}"

# ------------------------------------------------------------------------
# Obtain URLs to download.
# ------------------------------------------------------------------------

# This step downloads ~700 html files containing ~38,000 zip file links. This
# will take about 30 minutes.

echo "-------------------------------------------------------------------------"
echo "Harvesting zipfile URLs for format [${FORMAT}] in language [${LANG}]."
echo "-------------------------------------------------------------------------"

# Only do this if it hasn't been done already.
if [ ! -f "${ZIP_LIST}" ] ; then
  # The --mirror mode of wget spiders through files listing links.
  # The two second delay is to play nice and not get banned.
  wget \
    --wait=2 \
    --mirror \
    "http://www.gutenberg.org/robot/harvest?filetypes[]=${FORMAT}&langs[]=${LANG}"

  # Process the downloaded HTML link lists into a single sorted file of zipfile
  # URLs, one per line.
  grep -oh 'http://[a-zA-Z0-9./]*.zip' "${DIR}/www.gutenberg.org/robot/harvest"* | \
    sort | \
    uniq > "${ZIP_LIST}"

  # Get rid of the downloaded harvest files now that we have what we want.
  rm -Rf "${DIR}/www.gutenberg.org"
else
  echo "${ZIP_LIST} already exists. Skipping harvest."
fi

# ------------------------------------------------------------------------
# Download the zipfiles.
# ------------------------------------------------------------------------

# This will take a while: 36 to 48 hours. Just let it run. Project Gutenberg is
# a non-profit with a noble goal, so don't crush their servers, and it isn't as
# though you'll need to do this more than once.

echo "-------------------------------------------------------------------------"
echo "Downloading zipfiles."
echo "This will take 36-48 hours if starting from scratch."
echo "-------------------------------------------------------------------------"

for URL in $(cat "${ZIP_LIST}")
do
  ZIP_FILE="${ZIP_DIR}/${URL##*/}"
  # Only download it if it hasn't already been downloaded in a past run.
  if [ ! -f "${ZIP_FILE}" ] ; then
    wget --directory-prefix="${ZIP_DIR}" "${URL}"
    # Play nice with a delay.
    sleep 2
  else
    echo "${ZIP_FILE##*/} already exists. Skipping download."
  fi
done

# ------------------------------------------------------------------------
# Unzip the zipfiles.
# ------------------------------------------------------------------------

echo "-------------------------------------------------------------------------"
echo "Unzipping files."
echo "-------------------------------------------------------------------------"

for ZIP_FILE in $(find ${ZIP_DIR} -name '*.zip')
do
  UNZIP_FILE=$(basename ${ZIP_FILE} .zip)
  UNZIP_FILE="${UNZIP_DIR}/${UNZIP_FILE}.txt"
  # Only unzip if not already unzipped. This check assumes that x.zip unzips to
  # x.txt, which so far seems to be the case.
  if [ ! -f "${UNZIP_FILE}" ] ; then
    unzip -o "${ZIP_FILE}" -d "${UNZIP_DIR}"
  else
    echo "${ZIP_FILE##*/} already unzipped. Skipping."
  fi
done