Here's a little perl script that might get you started. You can use it as
perl removelatexcode.pl myfile.tex myfile1.tex
and can call it with as many files as you like (or you could pipe into it too).
It does the following:
- copies your input file,
myfile.tex to myfile.tex.bak just in case something goes wrong
- loops through each line in the file, and only starts working once it hits
\begin{document}
- once it is in the main document, it matches patterns such as
\begin{<myenvironmentname>}, \end{environmentname}, \<name of command> you can add to it as you see fit.
The way the code stands it won't overwrite the original file. Once you're happy with it, and have tested it to your liking, feel free to go ahead and use the file as
perl removelatexcode.pl -o myfile.tex
which will overwrite myfile.tex.
Always be careful when using scripts like this- there was no malicious intent here, but, you should test it thoroughly before using it on live files.
If there are some commands for which you wish to keep the argument, for example, \underline{keep this argument} then simply populate
my %keeparguments=("textit"=>1,
"underline"=>1,
);
with the appropriate commands.
removelatexcode.pl
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
use Getopt::Std;
# get the options
my %options=();
getopts("o", \%options);
my $inpreamble=1; # switch for in the preamble or not
my $filename;
my @lines=(); # @lines: stores the new lines without commands
# commands for which we want to keep the arguments- populate
# as necessary
my %keeparguments=("textit"=>1,
"underline"=>1,
);
while (@ARGV)
{
# get filename from arguments
$filename = shift @ARGV;
# open the file
open(INPUTFILE,$filename) or die "Can't open $filename";
# reset the preamble switch
$inpreamble=1;
# reset the lines array
@lines=();
# loop through the lines in the INPUT file
while(<INPUTFILE>)
{
# check that the document has begun
if($_ =~ m/\\begin{document.*/)
{
$inpreamble=0;
}
# ignore the preamble, and make string substitutions in
# the main document
if(!$inpreamble)
{
# remove \begin{<stuff>}[<optional arguments>]
s/\\begin{.*?}(\[.*?\])?({.*?})?//g;
# remove \end{<stuff>}
s/\\end{.*?}//g;
# remove \<commandname>{with argument}
while ($_ =~ m/\\(.*?){.*?}/)
{
if($keeparguments{$1})
{
s/\\.*?{(.*?)}/$1/;
}
else
{
s/\\.*?{.*?}//;
}
}
# print the current line (if we're not overwritting the current file)
print $_ if(!$options{o});
push(@lines,$_);
}
}
# close the file
close(INPUTFILE);
# if we want to over write the current file
if ($options{o})
{
# make a backup of each file
my $backupfile= "$filename.bak";
copy($filename,$backupfile);
# reopen the input file to overwrite it
open(INPUTFILE,">",$filename) or die "Can't open $filename";
print INPUTFILE @lines;
close(INPUTFILE);
# output to terminal
print "Backed up original file to $filename.bak\n";
print "Overwritten original file without commands";
}
}
exit
Here's a little test case:
myfile.tex
\documentclass{article}
% in the preamble
% in the preamble
% in the preamble
\begin{document}
\begin{myenvironment}
text text text text text text text text text text
text text text text text text text text text text
text text text text text text text text text text
text text text text text text text text text text
\end{myenvironment}
\mycommand{argument} more text after it \anothercommand{another argument}
\textit{keep this argument} more text after it \anothercommand{another argument} yet more text
\anothercommand{another argument} yet more text \textit{keep this argument} more text after it
\begin{anotherenvironment}[optional arguments] could have text here
other other other other other other other other other other
other other other other other other other other other other
other other other other other other other other other other
other other other other other other other other other other
\end{anotherenvironment}
\begin{anotherenvironment}[optional arguments]{mandatory args} could have text here
another another another another another another
another another another another another another
another another another another another another
another another another another another another
\end{anotherenvironment} can have text here
\end{document}
and the output of
perl removelatexcode.pl myfile.tex
Output
text text text text text text text text text text
text text text text text text text text text text
text text text text text text text text text text
text text text text text text text text text text
more text after it
keep this argument more text after it yet more text
yet more text keep this argument more text after it
could have text here
other other other other other other other other other other
other other other other other other other other other other
other other other other other other other other other other
other other other other other other other other other other
could have text here
another another another another another another
another another another another another another
another another another another another another
another another another another another another
can have text here
A few words about regexp
You'll notice the script uses lines such as
s/\\begin{.*?}(\[.*?\])?({.*?})?//g;
This matches
\begin{<environmentname>}
\begin{<environmentname>}[<optional arguments>]
\begin{<environmentname>}[<optional arguments>]{<mandatory arguments>}
but it does so in a non-greedy way. The .*? makes it no-greedy, and the ? after the grouping () make them optional. If these matches were greedy (which they would be without the ?) then you would get a lot of potentially unwanted results.
detex+sed; orpdftotext? – jon Mar 14 '13 at 03:00pdftotext. You won't run into OCR problems when you have a "regular" non-scanned PDF likepdftexand friends produce. – Mike Renfro Mar 14 '13 at 12:06pdftotextand/ordetex+sed. I am interested in learning more about those options and will do some reading about them in the next few days. – lawlist Mar 15 '13 at 05:28detex -n full_path_to_tex_file.tex > output_text_file.txt, but that only exported my\newgeometrysettings in the preamble and nothing else. – lawlist Mar 18 '13 at 04:28