Large datasets on CTAN

Question

I've acquired a recent interest in GIS and do not know of much support in drawing maps/boundaries/locations within LaTeX. pst-geo provides some mapping features, specifically at the PostScript level. I'm interested in creating something more open/available in the form of boundary files that have easily accessible latitude/longitude coordinates for the boundaries/shapes, similar to what KML files provide. However, I can see how this could easily blow out of proportion when considering the entire globe (jurisdictions within jurisdictions within jurisdictions, ...)

As time goes by, more and higher detail data would probably become available, increasing the size of the data sets.

I'm looking for answers to the following:

What is the best way to tackle this, specifically in terms of its location on CTAN?
I think it would be unreasonable to require an installation of the entire data set on a user's computer. How would one require users to install large datasets in a piece-meal fashion?
Would all of it be hosted on CTAN, or do I need to host the large "external" data sets on a server of my own?

Here are my thoughts on this:

Create some base-level package, say gis-maps;
Allow users to load modules, perhaps specific to a country using
```
\usepackage[italy,south-africa,canada]{gis-maps}
```
or
```
\usepackage{gis-maps}
\gissetup{maps={italy,south-africa,canada},...}
```
that would load a load a list of helper-macros specific to those jurisdictions. For example, based on country codes, the above might create something like \drawITA, \drawZAF and \drawCAN (amongst a host of other macros, perhaps based on some geographical hierarchy).
The above modules would also load the coordinates of the boundaries.
The base package and modules might be big, but still manageable. However, the data sets themselves would be very large. So one would include instructions on how to add these to your distribution as a manual addition, perhaps to a location like texmf-local. I don't know how this would work...

_{Stay calm, I won't be including any treasure maps...}

Just to clarify: The ultimate goal of what you are doing is that (possibly with extra downloads) a TeX user would have macros available for things like "draw me a map of Africa" or "draw me a map of São Paulo." Is this correct? — Charles Staats, Dec 24 '13 at 17:25
@CharlesStaats: Yes, including even more detail. So, "draw me a map of Africa" is a continent-level drawing. However, one may also be interested to "draw a map that includes all the countries within Africa"; or "draw a map with all the provinces of South Africa", ... The idea is to built a geographical hierarchy that one can load a specific jurisdiction (say, country) based on codes, but in a natural way also load (selectively) jurisdictions within that code (say, provinces, territories, municipalities, ...). Regardless, lat/long codes for each jurisdiction needs to be available somewhere. — Werner, Dec 24 '13 at 17:41
What I can think of is a script similar to getnonfreefonts so one can download maps on demand. — egreg, Dec 24 '13 at 20:39
i can't see the ctan team agreeing to hold any more than a few outline maps. gis dbs are big deals, and a multi-scale thing (as i read your post) could easily end up very large indeed and overwhelm the ctan structure. (in retrospect, the databases with the pst-geo* packages should never have been allowed...) — wasteofspace, Dec 24 '13 at 21:52
As a matter of interest, wouldn't it be possible to write the package with whatever information is necessary in order to process the datasets (the country/district codes), while keeping the maps on a separate hosting service? Another question is, how do you think of maintaining the whole project? (Obviously, different people will have to input maps and data.) — ienissei, Dec 24 '13 at 23:52
@ienissei: Like all things I attempt, I thought it would start small (perhaps just the globe and countries), and then expand over time (higher resolution, provinces, municipalities, etc). All of this, done by just trawling the Internet and finding lat/long shape files and converting them into a manageable format. Good idea about the distribution though. — Werner, Dec 25 '13 at 00:18
@wasteofspace: Thanks for weighing in. Where would one be able to host such a bounty of information (for free)? Github? Amazon? ? — Werner, Dec 25 '13 at 00:26
This seems to me to consist of two things: the TeX-specific stuff + the datasets. The assumption seems to be that the latter need to be customised for TeX but surely that is not a very efficient approach? Wouldn't it make more sense to identify (or create) datasets for general use and then figure out a way to interact with them through TeX (or through TeX plus some scripts or whatever)? I don't think the TeX community is likely to do a good job maintaining huge datasets over time, especially ones which are not inherently tied up with the system. — cfr, Dec 25 '13 at 02:11
I'd agree that hosting it externally is better, especially as this data changes over time. GitHub seems like a natural choice depending on your dataset format. You can draw from external sources, too. — Sean Allred, Dec 25 '13 at 05:43
I agree with @cfr that the datasets should remain in their original form. It should not be too hard to write TeX code to handle a useful subset of the KML syntax, and have users download KML files instead of a new file format Taylored to TeX. One option would be to provide a script (not necessarily written in TeX code) which the user should run once on the KML file to produce a TeX-friendly format. Another advantage, besides not having to worry about storing this data yourself, is that users could convert/use their own (possibly private) KML files. — Bruno Le Floch, Dec 25 '13 at 18:32
@BrunoLeFloch: This is a great idea! I guess a script to process a KML (ideal) will fit on [so]. — Werner, Dec 26 '13 at 02:40
@cfr: Care to write up an answer? Perhaps collect some of Bruno's thoughts on the matter as well. — Werner, Dec 31 '13 at 07:05
Is there a reason that OpenStreetMaps was not mentioned? I believe this is THE ultimate source for this kind of data. One merely has to teach TeX how to understand this data. — Dror, Dec 31 '13 at 15:15
Another option, is to use R with knitr to process the KML data and plot the maps. Now LaTeX is just displaying graphic files. I'm in the midst of setting up for teaching 5 courses this spring so I don't have time to flesh this out, but check this site; http://www.r-bloggers.com/mapping-primary-care-trust-pct-data-part-1/ You can search the R-blogger site for kml and there are more than 10 blogs about using kml files with R. — R. Schumacher, Dec 31 '13 at 16:21

score 12 · Accepted Answer · answered Dec 31 '13 at 14:03

I'm writing this answer as requested in the comments to the question, incorporating some of Bruno Le Floch's ideas. Since I know nothing about KML syntax, suggestions in this regard are very welcome!

Maintaining TeX-specific versions of the datasets is likely to produce a less than ideal solution. First, it is inefficient since it will need to duplicate maintenance work already done elsewhere. Second, I don't think the TeX community is likely to do a good job maintaining huge datasets over time, especially ones which are not inherently tied up with the system.

So it would be better to think of the problem as requiring two things:

identification or creation of suitable datasets designed for general use;
design and maintenance of a way for TeX to interact with these datasets, perhaps using scripts.

So the idea, as elaborated by Bruno Le Floch would be for users to download multi-purpose KML files as required. A script could be provided to download these files and to extract a useful subset of the information in them into a format which TeX could then use directly in typesetting. This need not itself be written in TeX code.

One option would be to use something like perl which should make the script usable on the platforms supported by TeX Live, for example, since TL itself depends on scripts written in perl. (In the case of Windows, TL provides perl itself; OS X, GNU/Linux etc. already have perl available.) perl is used for the getnonfreefonts script mentioned by egreg.

A package would then be provided to interact with the extracted subset of information, offering user-friendly macros to utilise this information in documents. Since the extracted subset would be smaller than the original KML dataset, this would be faster to parse, speeding up typesetting. Since the extraction would be scripted, it would be easy to update by re-downloading and re-extracting information from the original source of the datasets. In cases where currency is really critical, the updating could be automatically managed by having TeX run the download and extraction script during typesetting. But I assume this would not be very useful in the majority of cases.

This would solve several problems:

The problem of storing huge datasets would evaporate since the KML files would be stored wherever they are stored anyway.
It would avoid the issue of duplicating work within the TeX community which is better done by (probably larger and better equipped) communities elsewhere.
Moreover, Bruno Le Floch also pointed out that it would allow users to convert and use their own private KML files.
Indeed, it would allow the use of KML files from any source, and would easily generalise should files with similar syntax be used in other contexts. (I don't know anything about KML so this is a purely theoretical/hypothetical point!)

Large datasets on CTAN

1 Answers1

Linked