A few years ago, I needed to find as much LAS data for wells in Texas as I could. There isn’t much public data available for wells in Texas, but the University of Texas, endowed with a significant land grant, does provide free public access to a lot of wireline log data for wells on their lands. The catch is that it is only accessible via FTP.
Now, I love using python for everything as much as the next guy, but sometimes python is not the right tool for the job. And when it comes to scraping a FTP server, python is definitely not the right tool for the job.
On the University of Texas Lands data FTP site, for example, there are raster images and LAS log data divided across thousands of subdirectories. As each well has its own directory, nested in a county directory, it becomes extremely tedious to navigate the directory tree and search for files of a specific format. Traversing each subdirectory using python proved to be error-prone, unstable, and extremely time-consuming. Luckily there is a better way.
Enter LFTP, a GPL-licensed command line file transfer program from the 1990s (still maintained and updated, God bless them!). After installing the program with your package manager, the problem of scraping all LAS files from a FTP server reduces down to literally one single command(!).
lftp -c "mirror -i '\.[lL][aA][sS]$' -P 8 ftp://publicftp.utlands.utsystem.edu/ScannedLogs/"
That’s it. With that command you will feel god-like powers as thy computer does thy will.
A little explanation for this command is in order:
lftpinvokes the program; issuing that alone will launch an interactive prompt, but using the
-cflag will cause LFTP to run the command that you give in the following double-quotes and automatically exit when it has completed.
- Within the double quotes, the
mirrorcommand will copy everything from the remote server and store it locally (in whichever folder I am currently in).
mirrorwill make LFTP only copy files matching the given regular expression. In this case, I want all the LAS files and only LAS files, so for my regular expression I use
'\.[lL][aA][sS]$, which will match filenames that end with “.las” (case-insensitive).
-P 8flag means that I am asking LFTP to download files in parallel; here I specify 8 transfers at a time. This speeds things up a bit.
- Lastly, I provide the target FTP site, in this case ftp://publicftp.utlands.utsystem.edu/ScannedLogs/
If you run this command, in a matter of a few short hours, you will have on your local machine all the LAS files that are available from UT-Lands (when I last ran this a year or two ago it was 30+ GB of data, which included numerous specialty logs like NMR, dielectric, neutron spectroscopy, etc., as well as loads of triple combos). If you end up using this code to retrieve LAS files, I’d really appreciate a shout out!
LFTP can do a lot of wonderful things, and this command just scratches the surface, so please take a look at the man pages and see what magic you can do with it.
P.S. please remember to scrape data respectfully and responsibly.