5

For a given web page http://www.nytimes.com/, is it possible to save it as :

first: only HTML(within css)

second: more elements(include pictures and so on)

just like what a browser can do.

I have tried this, but it only generates a large picture with plain text.

Export["F:\\nytimes.html", Import["http://www.nytimes.com"], "HTML"]
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
withparadox2
  • 2,481
  • 1
  • 24
  • 28

2 Answers2

5

using Mathematica 9 the easiest way is URLSave

URLSave["http://www.nytimes.com" , "C:\\temp\\test9.html"]

you get the output

"C:\temp\test9.html"

and then directly from within Mathematica open your html default browser

SystemOpen[%]

with earlier versions try the following

source = Import["http://www.nytimes.com", "Source"];
Export["C:\\temp\\test8.html", source, "Text"]

you get the output

"C:\temp\test8.html"

SystemOpen[%]
bobknight
  • 2,037
  • 1
  • 13
  • 15
4

I can understand, if Mathematica does not provide such functionality. It is running on top of an operating system, which delivers all the functionality to do these things, like socket I/O etc.

I don't see the point to do this inside of Mathematica.

What you can do is this:

a) unix plattform

Run["/path/to/wget", "http://www.nytimes.com"];

This is just running wget with the default settings. wget does have a load of options which you can set to modify its result (for instance, if I want to download a webpage and its requisites (css links to other pages) I regularly use wget -E -H -k -K -p).

b) windows plattform

In case you don't want to download wget for windows...

1) write a powershell script (wget.ps1):

    (new-object System.Net.WebClient).
         DownloadFile($args[0],'C:\tmp\index.html')

2) @Bobthechemist found out how to run this on Windows platform:

Run["powershell.exe -ExecutionPolicy ByPass -file c:\\tmp\\wget.ps1 \ http://www.nytimes.com"]

Once the webpage is downloaded, you can start to do all the extractions you want to do.

Edit 1:

Since I've read in your comments I know that you're about to think of a pure Java solution, you might consider this:

import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
import java.net.URLConnection;

public class WebPageSaver {
//    public static void main(String args[]) throws Exception {
//      saveWebpage("http://www.nytimes.com", "/path/to/your/home/index.html");
//    }

    public static void saveWebpage(String site, String target ) throws Exception {
        OutputStream out = new FileOutputStream(target);
        URL url = new URL(site);
        URLConnection conn = url.openConnection();
        conn.connect();
        InputStream is = conn.getInputStream();

        byte[] buffer = new byte[4096];
        while(true) {
            int numBytes = is.read(buffer);
            if(numBytes == -1) {
                break;
            }
            out.write(buffer, 0, numBytes);
        }
    }
}

And then you can use one of the many approaches to link this into you Mathematica environment. Like:

1) Java-Reloader by Leonid <-- recommended!

2) Hands down approach

P.S.: i wrote the windows part answer just out of my memory and i don't have anything here to verify that this is working...

Stefan
  • 5,347
  • 25
  • 32
  • Nice approach - I don't know powershell, but the script runs fine in powershell; however Run["powershell.exe -ExecutionPolicy ByPass -file c:\\tmp\\wget.ps1", \ "http://www.google.com"] just makes an empty file. – bobthechemist Jul 17 '13 at 00:07
  • yes. that's because you need to separate the call to the application from its arguments. since powershell.exe -... does not exist. if Run succeeds it will return 0 for EXIT_SUCCESS – Stefan Jul 17 '13 at 00:14
  • I think it would be super-sweet if I could Import[ webpage ] and it would render the whole web page in an output cell of Mathematica. – Eric Brown Jul 17 '13 at 00:33
  • @EricBrown agreed...but, i've no idea. on the other hand they do this with WolframAlpha calls, don't they? – Stefan Jul 17 '13 at 00:40
  • @EricBrown really? was thinking about your wishful thinking and I don't get the benefit...why do you think that? because it would show that the notebook interface is able to handle any mime automatically? – Stefan Jul 17 '13 at 01:00
  • @Stefan For example, I could Import[ coolMMAStackExchangeURL, "WebArchive"] into a section, and then I could do some of my own calculations in a new section below. (It would be just like a Wolfram Alpha query). I could save the whole file in one convenient notebook. Could I just embed a hyperlink to the web page? Yes. But one could not guarantee the web page would always be there. (I like webarchives on Macintosh.) Also, the content would be available to me offline. How often am I offline? Seldom, but wireless is often spotty. Anyway, many people even want emacs to have an html renderer... – Eric Brown Jul 17 '13 at 01:54
  • @Stefan Thanks. It actually works with Run["powershell.exe -ExecutionPolicy ByPass -file c:\\tmp\\wget.ps1 \ http://www.nytimes.com"] – bobthechemist Jul 17 '13 at 03:18