2

What is the best way to export a notebook as readable plaintext - while preserving as much structure as possible? The motivation here is the need to bulk-ingest notebooks and chunk them (for use in RAG systems).

In such a solution for example:

  • textual elements would be stripped of styling
  • section headings would be preserved
  • cell-tags could be listed underneath cell contents
  • For non-text elements - those would be simply replaced with their notebook asset filepaths/hyperlinks
  • ...

There were a few legacy answers about exporting to multi-markup/down that are now out of date, but is there any new "best way" to achieve this in v13.3/14?

user64494
  • 26,149
  • 4
  • 27
  • 56
user5601
  • 3,573
  • 2
  • 24
  • 56

1 Answers1

2

I recently found myself in a similar situation where I wanted text-only but structured files generated from my notebooks for source control and revision tracking. I could not find an optimal solution, but ultimately I landed on exporting the notebooks as a package (*.m) file.

The following code essentially automates the process of opening each file in the front end, and then selecting "Save As..." in the File menu and choosing the "*.m" package option.

I wrote a helper function to do the following:

  • obtain the notebook object for the notebook of interest: first retrieve a list of objects for all currently open notebooks, then open the one of interest (the RunProcess part), then retrieve a list of all open notebooks again, and take the difference;
  • generate a new file name with a modified extension, replacing ".nb" with ".m"
  • save the notebook in "Package" format with the new name using front end tokens (i.e. menu items);
  • close the notebook.

My files were all in a single directory, so I just mapped this function over a list of notebook file names.

dir = SetDirectory["C:\\path\\to\\the\\files"];
files = FileNames["*.nb"];

ClearAll[openSaveNB] openSaveNB[name_] := Module[ {nbObject, before, newName}, before = Notebooks[]; RunProcess[{"cmd", "/c", name}]; nbObject = First@ Complement[Notebooks[], before]; newName = FileNameJoin[{dir, StringReplace[name, ".nb" -> ".m"]}]; FrontEndTokenExecute[nbObject, "Save", {newName, "Package"}]; FrontEndTokenExecute[nbObject, "Close"] ]

openSaveNB /@ files

The process itself worked; I am still figuring out if the output is acceptable for my goals, but I thought I'd share the process anyway in case you find it useful.

A couple of caveats: 1) I work in Windows and the code makes the assumption that *.nb files are associated with Mathematica, so that invoking one such file in the shell automatically opens it; 2) in FrontEndTokenExecute[nbObject, "Save", {newName, "Package"}], newName should be a fully qualified file name, with the directory as well, otherwise the resulting file is saved in some default location that ignores the SetDirectory directive.

MarcoB
  • 67,153
  • 18
  • 91
  • 189
  • 1
    I've been using recipe suggested here and it works except when I copy/pasted some graphics into notebook manually, like here, in which case some of my notebooks become 300MB package files, also looking for a solution that does text-only – Yaroslav Bulatov Dec 03 '23 at 07:38