24

I often work with big tables that I want to treat kind of like a database. Here's an example table.

theTable = {
 {"id","color","size","flavor"},
 {1,"blue",5,"cherry"},
 {2,"green",5,"piquant"},
 {3,"blue",20,"peppermint"}
}

In a database I would ask

SELECT * FROM `theTable` WHERE `color` = 'blue' AND `size` > 10

and get effectively

{{3, "blue", 20, "peppermint"}}

in response. In Mathematica, I need to determine the "column number" for color and size, then use Cases with And to do the same thing.

Cases[theTable[[2 ;;]], a_ /; And[a[[2]] == "blue", a[[3]] > 10]]

This is operationally much clumsier than the database way. Unfortunately setting up a database for every such table is too much extra work, particularly since I want the data to end up in mma anyway.

How can this approach be improved? Specifically, how can I more easily use the column names directly, instead of their part numbers? And how can I avoid the ugly a_/;f[a] pattern?

rcollyer
  • 33,976
  • 7
  • 92
  • 191
ArgentoSapiens
  • 7,780
  • 1
  • 32
  • 49
  • 6
    Have you seen this answer, and those linked to it, particularly this one? Those can be a starting point. – Leonid Shifrin Sep 26 '12 at 22:52
  • I had not seen those. That is a neat solution and I'll give it a try. Thank you. – ArgentoSapiens Sep 26 '12 at 23:01
  • If you went with DatabaseLink you don't have to create one database for each table, but one table for each table. Not soo much work – Rojo Sep 26 '12 at 23:07
  • @Rojo I did not test it for hsqldb, which is in-memory database and therefore should be fast, but since all database access in DatabaseLink goes through JDBC and J/Link, I would think that for large tables and when the result of the query is also a large collection of rows, the performance hit may be quite noticable. – Leonid Shifrin Sep 26 '12 at 23:10
  • 2
    You know that Extract[theTable, Position[theTable, {_, "blue", x_ /; x > 10, _}]] works, right? – J. M.'s missing motivation Sep 26 '12 at 23:14
  • Just to be clear, in your real application you don't actually know what column number colour and size are? If you do know the actual structure of your table then why not Cases[theTable, {_, "blue", x_ /; x > 10, _}]? – Mike Honeychurch Sep 26 '12 at 23:29
  • @J.M. and Mike, the real table sometimes has hundreds of columns. Today's has 36, which is still too many to build the pattern like that, don't you agree? – ArgentoSapiens Sep 26 '12 at 23:37
  • 2
    @ArgentoSapiens that was important information for us ;) – Mike Honeychurch Sep 26 '12 at 23:40
  • Consider, then, ReplacePart[ConstantArray[_, 4], Thread[Flatten[Position[{"id", "color", "size", "flavor"}, #] & /@ {"color", "size"}] -> {"blue", x_ /; x > 10}]]. – J. M.'s missing motivation Sep 26 '12 at 23:40
  • 1
    @Argento: "that was important information for us" - which you should have included to begin with. – J. M.'s missing motivation Sep 26 '12 at 23:41
  • @J.M. I think that one of the biggest conveniences of SQL tables is that you get a symbolic (string) handle for each column, which is its name. It is invariant under permutations of columns and addition / removal of columns, so SQL queries are maintainable. The solutions I linked to solve this problem by automatically generating a pure function used in the Select query by using the table itself, so require a minimal effort from the user. I think, you are suggesting something similar but based on Cases-Position. – Leonid Shifrin Sep 26 '12 at 23:48
  • @Leonid, yes something like that. :) I haven't seen your SO answer before; thanks for that! – J. M.'s missing motivation Sep 26 '12 at 23:53
  • @J.M. No problem :-). I was thinking of extending those answers into a tiny framework emulating basic SQL stuff, but did not yet find the time to sit down and do it. – Leonid Shifrin Sep 26 '12 at 23:59
  • @LeonidShifrin that would be a useful package. – Mike Honeychurch Sep 27 '12 at 00:08
  • @MikeHoneychurch Yes, I tend to think so too. Since the simplest set of operations is probably not hard to implement, I will think about it. One thing I was concerned about was efficient inserts. Also, if I want to support indexes, things will get more complex. This is one of these problems where it is very easy to keep asking a little more and more and then suddenly open a can of worms :-). I think it is the anticipation of this that prevented me from actually doing this. – Leonid Shifrin Sep 27 '12 at 00:12
  • 1
    @LeonidShifrin yes it could easily balloon into something large. Reflecting on this a bit more, for larger tables for any user that is familiar with basic SQL queries it is probably easier to use a database -- this is what they are for -- and use DatabaseLink. – Mike Honeychurch Sep 27 '12 at 00:20
  • @MikeHoneychurch Yes, I agree. So, I would rather have an in-memory production-quality SQL database integrated into Mathematica via LibraryLink with a well-designed Mathematica-side API (perhaps the same as DatabaseLink), than rolling out something in the top-level Mathematica which would be only of a very limited value. This was actually another reason that I did not go there. – Leonid Shifrin Sep 27 '12 at 00:27
  • @LeonidShifrin and Mike, is there a way to have a local database without installing a local MySQL server? – ArgentoSapiens Sep 27 '12 at 12:55
  • @ArgentoSapiens Well, hsqldb seems to be such a database. It comes with Mathematica I think. – Leonid Shifrin Sep 27 '12 at 13:00
  • @ArgentoSapiens as Leonid mentioned Mma comes bundled with a database. The documentation runs through some usage examples (I have never used hsqldb myself other than in these examples). Coincidentally I was looking at this a couple of days ago because it appears to be included in Player Pro as well -- which could be useful for building apps. – Mike Honeychurch Sep 27 '12 at 23:00

5 Answers5

13

A way of getting around the a_/;test[a] syntax is to write out the tests in string form, and use replace to insert the values. For this to work you need to build rules from your table. Here is a simple implementation:

 SetAttributes[queryCriteria, HoldAll]
 queryCriteria[theTable_, query_] := Function[{entry}, 
 Unevaluated[query] /. (Rule @@@ Transpose[{theTable[[1]], entry}]), HoldAll]

 Select[theTable, queryCriteria[theTable, "color" == "blue" && "size" > 10]]

Personally I would prefer not having to give theTable as an argument to the query function constructor, since conceptually you shouldn't need a table to define a query, however it's needed during the construction because you have the field names listed in the first row. A way to nicely work around this is to consider a query an indpependent entitiy, which doesn't use the table until it's used in Select. This can be defined by setting an Upvalue pattern for Select, to resemble your included example, I use where as a name for the query:

 SetAttributes[where, HoldAll]
 Select[table_, where[query_]] ^:= Select[table, queryCriteria[table, query]]

So that the query can be written:

 Select[theTable, where["color" == "blue" && "size" > 10]]

This is all just ways of doing a similar thing with different syntax however. I would expect that performance issues become more important with big Databases.

jVincent
  • 14,766
  • 1
  • 42
  • 74
13

Don't reinvent the wheel: If you need a database you should be aware of the SQLite access readily built into Mathematica, though unfortunately undocumented:

db = Database`OpenDatabase[FileNameJoin[{$TemporaryDirectory, "mma-temp-db.sqlite"}]];

Database`QueryDatabase[db, 
    "CREATE TABLE stuff(id INTEGER PRIMARY KEY,color TEXT,size REAL,flavor TEXT)"
];

Database`QueryDatabase[db, "BEGIN"];

Scan[
  Database`QueryDatabase[db, 
    ToString@StringForm[
      "INSERT into stuff(color,size,flavor) VALUES ('`1`',`2`,'`3`')",
      Sequence @@ #
  ]] &,
  theTable[[2 ;; -1, 2 ;; 4]]
];

Database`QueryDatabase[db, "END"];

Database`QueryDatabase[db,"SELECT * FROM stuff WHERE color = 'blue' AND size > 10"]

Database`CloseDatabase[db]

and in case you are more into speed than persistency:

Database`OpenDatabase[":memory:"]

for details just look for documentation about sqlite, there is tons of good documentation around for it...

EDIT: as murta mentioned in his comment it is also possible to make use of SQLite with the officially supported and documented DatabaseLink`. In version 10 a corresponding driver is included, for earlier versions a SQLite JDBC-driver has to be installed manually. As far as I can tell using the Database`* functions is a very lightweight approach most probably making direct use of the sqlite libraries while DatabaseLink` makes use of Java/JLink/JDBC which is kind of heavyweight but of course also has its advantages. Also from murta is the above example using DatabaseLink:

Needs["DatabaseLink`"]
conn=OpenSQLConnection[JDBC["SQLite",$TemporaryDirectory<>"testBase.sqlite"]];
SQLExecute[conn,"CREATE TABLE stuff(id INTEGER PRIMARY KEY,color TEXT,size REAL,flavor TEXT)"];
SQLInsert[conn,"stuff",{"color","size","flavor"},theTable[[2;;-1,2;;4]]];
SQLExecute[conn,"SELECT * FROM stuff WHERE color = 'blue' AND size > 10"]
CloseSQLConnection[conn]

For in memory version use: conn=OpenSQLConnection[JDBC["SQLite(Memory)","jdbc:sqlite::memory:"]];

Just for completeness: there are also drivers for HSQL included in all versions of DatabaseLink that I can remember of which provide similar functionality as SQLite, since version 10 there are also drivers for H2 and Derby included which also claim similar functionality.

EDIT since version 11.1 the Database` functions have been removed. So for any version newer than 11.0 one has to use the DatabaseLink` approach, but as they come with the SQLite driver you still can access SQLite databases in those versions out of the box.

Albert Retey
  • 23,585
  • 60
  • 104
  • Executing the first command: db = Database`OpenDatabase[FileNameJoin[{$TemporaryDirectory, "mma-temp-db.sqlite"}]]; I get Database`Database::liberror: A required library could not be located or opened. in MMA V10.0.2 – Murta Feb 07 '15 at 00:03
  • You can do the same with documented commands, with: Needs["DatabaseLink`"] conn=OpenSQLConnection[JDBC["SQLite",$TemporaryDirectory<>"testBase.sqlite"]]; – Murta Feb 07 '15 at 00:08
  • interesting, which OS are you on? I've used that in several MMA versions (example was written and tested with V10.0.2) on Windows and never had problems, but I remember to have seen complaints of others... – Albert Retey Feb 07 '15 at 00:08
  • @murta: Yes, since version 10 I think there is a SQLite driver for DatabaseLink predefined, I'll make an addition about it. It probably is also worth mentioning that the undocumented access is very lightweight compared to what DatabaseLink does, but often using the documented stuff is still a better idea... – Albert Retey Feb 07 '15 at 00:10
  • yes, I just confirmed. Worked in Windows but not in Mac OSX 10.10.2. Do you know if there is some memory version for JDBC? I would be great. – Murta Feb 07 '15 at 00:12
  • @murta: I think that you just have to give :memory: or file::memory: where you usually would give the path to the sqlite file in the connect string to open an in memory database, but I have not tried if that works with DatabaseLink. I'd really be interested to learn what goes wrong on OSX with Database`OpenDatabase. AFAIK sqlite as such is available and working on macs, so it probably is just a matter of adjusting/correcting some library pathes... – Albert Retey Feb 07 '15 at 00:21
  • just tried conn = OpenSQLConnection[JDBC["SQLite", ":memory:"]] and that seems to work... – Albert Retey Feb 07 '15 at 00:24
  • Using "::memory::" just create a SQLite file called "//memory//" in your current DirectoryStack[] not memory. – Murta Feb 07 '15 at 00:32
7

I had forgot about this question, or the answer linked by @Leonid, or @jVincent's etc, and last week I was under the same "need".

I'll just post what I used since it's no extra work, in case it still helps someone.

Speed wasn't a concern, so I have no clue how much time this wastes

LabeledMatrix[cs_, mat_][cols : {__String}, funQ_] := 
    LabeledMatrix[cs, mat][cols, funQ, cols];

Normal[LabeledMatrix[_, mat_, ___]] ^:= mat

(lm : LabeledMatrix[cs_, mat_?MatrixQ])[cols : {__String}, funQ_, showCols : {___String}] :=
    Extract[mat[[All, label2Position[lm, showCols]]], 
      Position[LabeledMatrix[cs, mat][cols], {i___} /; funQ[i], {1}]];

LabeledMatrix[cs_, mat_?MatrixQ][cols : {__String}, All] := 
    LabeledMatrix[cs, mat][cols];
(lm : LabeledMatrix[cs_, mat_?MatrixQ])[cols : {__String}] := 
    mat[[All, label2Position[lm, cols]]];

SetAttributes[label2Position, Listable];
label2Position[LabeledMatrix[cols_List, ___], lab_] := 
    First@Flatten@Position[cols, lab, {1}, 1];

There's basically no error checking, formatting rules, etc.

Usage

LabeledMatrix is a wrapper. It takes, as a first argument, the names of the columns, and as a second, the data matrix.

lm = LabeledMatrix[
   {"ID", "Person", "Age"},
   {{4, "Peter", 23}, {5, "Mary", 33}, {55, "John", 23}}];

Say you want the "Person" and "Age", column

lm[{"Person", "Age"}]

(* {{"Peter", 23}, {"Mary", 33}, {"John", 23}} *)

The first argument, (unless you use the 3 argument form), is a list of the columns you want as output.

If you give a second argument, then that second argument is a predicate function to filter rows. The arguments taken by that function are those supplied in the first argument. Example

lm[{"Age", "ID"}, #2 > #1 &]

(* {{23, 55}} *)

If you supply a third argument, it's the list of columns returned. The first argument still works as the input to the predicate. So, say you want the IDs of the people aged under 30

lm[{"Age"}, # < 30 &, {"ID"}]

(* {{4}, {55}} *)

A second argument of All is the same as nothing. Normal gives the data matrix. First, or some other convenience function you want to create, the names of the columns.

Rojo
  • 42,601
  • 7
  • 96
  • 188
7
  1. Keep your data in external .csv files. Data is stored/queried in/from "tables" in a db anyway.

  2. query=ReadList["!grep ...", String]

  3. Process query, use ToExpression where appropriate, etc.

This has the following advantages:

  • you don't actually keep your data in the kernel, you keep it in files, outside of the M system. That's where data storage SHOULD be: outside of the kernel. You shouldn't use the kernel for data storage, only for data processing.

  • now the kernel will contain and process only your extraction / query result / result set. MUCH more memory-efficient. You shouldn't have in the kernel what you're not interested in anyway.

  • grep and egrep are extremely fast. You're actually outsourcing the query to operating system speed.

  • you don't need the overhead of a real d/b system (CPU, memory, installation, drive space, etc.) You only consume the drive space needed for your .csv files

What I've described above is actually how I do store large data amounts of data that I want to be available for queries. I don't consider this a workaround, it's a solution. Instead of installing a d/b application, I ALWAYS ask if I can simply do my tables (which I'd have to set up in a d/b system anyway) as well-designed .csv files (conforming to certain d/b standards, such as normalization, etc.), and then use the speed of grep, o/s methods, and ReadList["!grep ...",...], and then string-based processing and possible type conversions. Should be extremely hard to beat that speed, you don't need external d/b applications, you don't have external links (as you may be aware of, M's DatabaseLink package uses JLink internally), and you have additional flexibilities with this approach. You could, for example, save the .csv files from spreadsheet programs or other generators, and you can zip them when you want to archive them (text files can be zipped down to 7%), and you could even include zip/unzip steps in your M program to "set up" your d/b. A collection of properly formatted .csv files is actually a relational database! A simple d/b is nothing but a collection of well-organized tables, so you can do it yourself with text files (I recommend .csv).

Andreas Lauschke
  • 4,009
  • 22
  • 20
  • I'm going to be dealing with large files and databases soon. It's been ten years since you wrote this. Have you discovered new methods to do this? Or is this still the optimal method? Are you dealing with other files besides CSVs now? – jWey Mar 25 '23 at 07:10
  • I still recommend to use such Linux command line tools, and to submit from ReadList (or sometimes Read) to the command line, rather than reading in everything into the M kernel. I still use mostly .csv, although the important thing here is to use text files, that can also be .xml or .json for non-rectangular formats, or others. – Andreas Lauschke Mar 26 '23 at 19:09
  • That's what I've gathered as well. Any interesting command line tools? perhaps there's specific mathematica integration? I have text, pdfs, graphs, images, videos, xml, json, and databases to play with. I'm just thinking out loud here: dealing with a lot of data is even more important today; and now we have Apple Silicon and unified computer architectures that can access media engines and neural engines and other specialized engines. I'm sure the people working behind the scenes of wolfram language have a different view of what is possible with the code and language. Lots of potential. – jWey Apr 03 '23 at 03:40
  • 1
    I have found grep, egrep, curl, sed the most useful Linux command line tools for "pre-extracting" data that is then read into the M kernel. ReadList["!command",...], don't forget the !. You can even submit entire perl programs to the command with the -e option, such as ReadList["!perl -e ",...], although I acknowledge that nowadays people would probably prefer python over perl. ReadList["!python – Andreas Lauschke Apr 05 '23 at 18:50
  • 1
    You can also use Java's REPL, the jshell, from the command line, and with the - option specify that you don't want interactivity, this is like the -e option for perl: ReadList["!echo "System.out.println(\"Hello World\");" | jshell -"]. I also find the -n option for perl very useful when used with -e, because it gives you the diamond operator without specifying the outer while loop. So with perl -e -n you get to parse the raw text data, line by line, in a diamond operator, without while loop. I find that incredibly useful, but I acknowledge perl is considered outdated nowadays. – Andreas Lauschke Apr 05 '23 at 20:12
  • Cool! I've never used perl (found the command line options you mentioned and some more: -e -E -n -p -i) or jscript. I could probably do something with python. I'll stick with unix commands at first to see the extent of what is possible with that. Aside from python, perl, jscript, and unix commands, what else have you been using? Anything with databases? Also, was there something that was just not possible or practical to do through wolfram language? – jWey Apr 10 '23 at 06:46
  • I had only mentioned perl just in case you need something that can not be done with simple Linux commands. I find that grep, curl, sed cover everything I need. After all, this post is about big tables / databases, and a relational database is simply: a collection of tables, so you don't need much more than these. And earlier you had also asked about .pdf, and there are several Linux tools for that too to extract text or tabular structures from .pdfs. – Andreas Lauschke Apr 12 '23 at 19:24
  • Cool! Thanks for responding. – jWey Apr 13 '23 at 05:59
  • I found some very interesting pdf related tools but not quite what I was expecting. What are some of the linux pdf tools you've been using? I'm on a mac with apple silicon btw. – jWey Apr 17 '23 at 20:43
  • Don't know anything about Macs, and that will never change, I won't even touch them with pincers. On Linux, for command line-based extraction from .pdf you could start with pdftotext. Start testing it from the command line with something like pdftotext mydoc.pdf - | grep mysearchstring. Then read that in with ReadList, as above, ReadList["!pdftotext mydoc.pdf - | grep mysearchstring",String]. You might prefer a workflow that converts to html first, and then parse the html, there is pdftohtml: ReadList["!pdftohtml whatever",String] – Andreas Lauschke Apr 22 '23 at 19:10
  • lol Pincers may be sufficient, no? I have a linux system now as well. pdftotext is running on the mac. The -xml Option is also very useful. pdffonts and the rest of popper are great. Are there other useful tools that come to mind? – jWey Apr 23 '23 at 00:38
  • For tables treated as databases, usually grep, egrep, sed do it for me. I submit a lot of Linux command line tools from M and then read back results with ReadList["!whatever",String], mostly because now I can harness M's string manipulation functions that would be much more tedious to manage from the Linux command line (shell scripts, aliases, etc.). For video editing I use ffmpeg a lot, which is what most video editing software use under the hood anyway. Most video editing software is basically just GUI-based configurations for ffmpeg, for example stuff like ... – Andreas Lauschke Apr 23 '23 at 18:58
  • ... ffmpeg -ss xx:xx:xx -i inputfile.mp4 -ss yy:yy:yy -t zz:zz:zz -c copy outputfile.mp4, where zz:zz:zz is a difference! So to add/subtract 60-based numbers I prefer to use M to generate the proper zz:zz:zz. And if properly configured, you can use ffmpeg with a GPU, making this very powerful. But all that has nothing to do anymore with the original question, using tables as databases. – Andreas Lauschke Apr 23 '23 at 18:58
  • True. To answer the original question, you could now use Query[Select[#color === "blue" && #size > 10 &]]@theNewTable or something to that effect. – jWey Apr 24 '23 at 08:45
3

This is my way. I'm used to name my columns as:

cId=1;cColor=2;cSize=3;cFlavor=4;

I append c because it's easy to use with the autocomplete, and prevent variables mix. And instead of cases, I prefer to use select as below:

r=Select[Rest@theTable, #[[cColor]] == "blue" && #[[cSize]] > 10 &]

If you want to take just some columns you can make for example:

r[[All,{cID,cColor}]]

Sometimes I have a lot os columns, so is boring to make the first step so I created this function that is in my tool bag:

sequenceVarList[list_] := Module[{varList},
    varList = StringReplace[list, {"_" -> "", " " -> ""}];
    Clear@@varList // Quiet;
    MapIndexed[(Evaluate[Symbol[#1]] = #2[[1]]) &, varList];
    TableForm[MapIndexed[{#2[[1]], #1} &, varList]]
]

So I can use it as:

sequenceVarList[{"cId","cColor","cSize","cFlavor"}]

and the first step is done easily.

I think that the big advantage of this solution is that you can use your column names in another calculations like GatherBy, SortBy and so on.

Update

If you are dealing with many tables, you can use the function sequenceVarList as:

Unprotect[Dot];
SetAttributes[Dot, HoldAll];
Protect[Dot];

mrtSeqVarList[varList_,symbol_Symbol] := Module[{},
    Clear[UpValues[symbol]];
    MapIndexed[(symbol/:symbol.#1=First@#2)&, varList];
]

So now you can do:

sequenceVarList[{cId,cColor,cSize,cFlavor},tab1]
r=Select[Rest@theTable, #[[tab1.cColor]] == "blue" && #[[tab1.cSize]] > 10 &]
Murta
  • 26,275
  • 6
  • 76
  • 166