44

I want to build a bot to automate web browsing, this mean something like:

  • filling forms
  • press "submit" buttons
  • find certain text inside pages
  • and so on...

How can I do this with Mathematica?

The Import function just make you download a single web page but it doesn't support cookies and similar things to build a complete automated bot, does Mathematica have some useful packet to do so?

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
Francesco
  • 571
  • 5
  • 8
  • If the form uses the GET method, submitting the form just means composing an URL (on which Import works). The question is very general in this form, and I'd be inclined to say Mathematica is not the right tool for this (you'll end up using JLink or .NETLink anyway). But if you can give a very specific example, we can think about how to implement it in Mathematica (or will be able to say with more confidence that it's not possible without external libraries) – Szabolcs Feb 15 '12 at 10:32
  • Yes, I know that for a GET method I can simply compose the URL in the right way. I give you an example: suppose I want to make a little script to log-in to an online service, put some informations in a form to generate a report, download the report and then use Mathematica to analyze this, and I want to do it "one-click", without make the log-in and download manually! – Francesco Feb 15 '12 at 10:39
  • I don't believe that this is possible in pure Mathematica. Any working solution you might get will use external libraries, most likely through either JLink or .NETLink. You might be able to drive a browser object through .NETLink on Windows, but I am not familiar with the technology. – Szabolcs Feb 15 '12 at 10:43
  • 2
    if you need support for cookies then use wget. You can include it in Mma code by using Run. About to go to bed but do a search cause posted something here in answer to another question a couple of weeks ago. ...here it is: http://mathematica.stackexchange.com/questions/1186/downloading-files-without-using-import/1211#1211 – Mike Honeychurch Feb 15 '12 at 11:26
  • For a bit of datamining on a website that Mathematica could not parse entirely (interactive java stuff) I used the iMacros Plugin for Firefox and remote-controlled Firefox to do a few things with a suitable script. It is a bit circuitous (and not within Mathematica, so no answer) but worked fine once set up properly. – Yves Klett Feb 15 '12 at 11:46
  • 3
    You can also try curl: http://stackoverflow.com/a/6977128/695132 – Szabolcs Feb 15 '12 at 12:08
  • Yes, curl is nice: I use RunCurl[x_String, dir_:"C:\\directorywherecurlis\\"] := Module[{id = ToString[Round[AbsoluteTime[]]], run, res}, run = Run2[StringJoin["%comspec% /c ", dir, "curl.exe ", x, " > ", dir, "curl", id, ".log 2>", id, "curl", id, ".err"]]; res = Import[StringJoin[dir, "curl", id, ".log"], "Text"]; DeleteFile[StringJoin[dir, "curl", id, ".log"]]; (If[FileExistsQ[#1], DeleteFile[#1]] & )[StringJoin[dir, "curl", id, ".err"]]; res]; – Rolf Mertig Feb 15 '12 at 12:37
  • with Run2[cmd_String] := Module[{shell}, Switch[$OperatingSystem, "Windows", If[$OperatingSystem === "Windows", Needs["NETLink`"]; shell = NETLink`CreateCOMObject["WScript.shell"]; ]; shell[run[StringReplace[cmd, {"\n" -> "", "\r" -> ""}], 0, True]], "Unix", Run[cmd], "MacOSX", Run[cmd]]]; – Rolf Mertig Feb 15 '12 at 12:37
  • 1
    Hi Rolf, can you put your comment in an answer with separated code? It's difficult to understand what you wrote in this way. Thanks a lot :) – Francesco Feb 15 '12 at 12:44
  • @Francesco: But it is no answer. Just a note on http://stackoverflow.com/a/6977128/695132 , i.e., you get access to the log files. – Rolf Mertig Feb 15 '12 at 14:05
  • 1
    I've failed in this before, especially w/r/t cookies. When trying to find text, I just use Import[x,"Source"] where x is the site (all manually downloaded w/ wget) and then find content using StringCases[] i.e. trlist = StringCases[pagetext, Shortest["<tr>" ~~ ___ ~~ "</tr>"]]; (which would find all text within rows in a page arranged in that way, for example) – canadian_scholar Feb 15 '12 at 14:24
  • Yes I've used curl for FTP-ing Mma content but it was only once or twice. wget is something I use regularly. – Mike Honeychurch Feb 15 '12 at 21:33
  • @Francesco Please see an answer here: http://mathematica.stackexchange.com/questions/2362/how-to-manipulate-web-pages-on-mathematica – Szabolcs Mar 03 '12 at 17:06
  • @szabolcs, how can this be a duplicate of http://mathematica.stackexchange.com/questions/2362/how-to-manipulate-web-pages-on-mathematica when this question is older? I suspect it is the other way around! ;). Either way we ought to either close this or have someone write up a suitable answer. If nobody does by tomorrow night, I'll go ahead and summarize all of what has been said in a community wiki answer. – nixeagle Mar 26 '12 at 04:24
  • @nixeagle It doesn't matter which one is older. The point is: the other question has a good answer which I think answers this as well as possible. A concrete example always helps. Others think this needs a general solution, not relying on a concrete example so it's still open. Also, I voted to close because that'd ensure that anyone finding this question will immediately be pointed to a good answer. – Szabolcs Mar 26 '12 at 05:57
  • @nixeagle Also, closing as duplicate is not penalizing the OP and is not a bad point for the OP. It's for keeping the site clean and useful for future visitors. This also applies to your ValueQ question---originally I voted to close because I thought it had been asked, not because it's a bad question (it is a good question). (Just to avoid any misunderstanding on why I vote to close.) – Szabolcs Mar 26 '12 at 05:59
  • I'm eager to see the improvements made to Import[] in version 9! – CHM Apr 02 '12 at 05:16
  • @Francesco On what platform are you? – Gustavo Delfino Apr 18 '12 at 03:59
  • did you see any improvements in 9, i have it, but still struggling with this exact issue! Any thoughts on weather i should take Jlink or curl? im thinking JLink has more support and may be more guided to websites? If you guys have any good resources or blogs let me know! thank you!! – Zlatko-Minev Jun 02 '13 at 15:05

1 Answers1

39

Here is a package which does what you want:

https://github.com/arnoudbuzing/webunit

Clone the repository from github, and place the WebUnit folder under $UserBaseDirectory/Applications

To use it:

  1. Needs["WebUnit`"]

  2. InstallWebUnit[] (* launches chromedriver.exe *)

  3. StartWebSession[] (* launches Chrome web browser, assuming you have that installed *)

  4. OpenWebPage["http://mathematica.stackexchange.com"] (* opens the web page *)

  5. ClickElement[Id["nav-users"]] (* clicks the web element 'nav-users' the users tab *)

And then TypeElement works similarly (assuming you have an input field with an id).

Edit: You can also use JavascriptExecute["alert('hi');"] to execute arbitrary javascript (in this example case it brings up the alert dialog).

Arnoud Buzing
  • 9,801
  • 2
  • 49
  • 58
  • 2
    Great application! But could you provide a WebUnit documentation in a pdf file? On my computer the provided documentation does not open properly. I am not sure whether the directory structure in the provided bundle is OK. BTW: is it possible to set a value to a RadioButton? – jano Dec 23 '14 at 22:49
  • 1
    @Arnoud Buzing Can you provide a minimal example of how to use Execute[] to run some javascript? I can find this function in the WebUnit source code. There is no documentation so I am asking you here. – PlatoManiac Jan 06 '15 at 04:30
  • This works for me: Execute["alert('hi');"] – Arnoud Buzing Jan 09 '15 at 20:17
  • The link above seems to be broken. Is there an update? – GregH Aug 20 '15 at 01:10
  • 2
    @GregH, I put a new version with a new link in the post above (not sure how this disappeared). Also, in this new version I renamed the function Execute to JavascriptExecute (more precise function name). – Arnoud Buzing Aug 20 '15 at 15:41
  • @ArnoudBuzing Are there any new functionalities added in the new version? – PlatoManiac Aug 20 '15 at 16:20
  • @ArnoudBuzing Tried to follow your steps above, but when I run InstallWebUnit[] I get the error message "Windows cannot find 'C:\Program'." I wonder if it is trying to access "C:\Program Files..." but the space is messing things up. Any suggestions? – GregH Aug 20 '15 at 17:23
  • @GregH Do you have the "WebUnit" directory directly under $UserBaseDirectory/Applications? – Arnoud Buzing Aug 20 '15 at 18:11
  • @GregH Also, please try re-downloading it (I think I uploaded an older version earlier today). If you run filehash webunit-master.zip you should get a15a9b82812999e2 – Arnoud Buzing Aug 20 '15 at 18:24
  • @ArnoudBuzing I unzipped the folder to C:\Program Files\Wolfram Research\Mathematica\10.0\AddOns\Applications as that seemed appropriate. I get the same error as before with the new .zip file. Also, this second version isn't as "clean" as the first; the first was only the folder WebUnit zipped; the latter has this folder, plus Tests and a few other files. (Hope that makes sense.) – GregH Aug 20 '15 at 18:48
  • @GregH OK, I've uploaded a new link which only contains WebUnit. Don't put anything under your installation directory, the appropriate location is $UserBaseDirectory/Applications. If that does not work, let me know. – Arnoud Buzing Aug 20 '15 at 19:07
  • @ArnoudBuzing Got it to work. Thanks! Now more questions: on the form I am trying to use, pressing Enter activates the "Submit" button, which unfortunately doesn't have an ID to use the ClickElement command with. Is there a way to do a generic TypeElement that has the same effect as pressing the Enter key? – GregH Aug 21 '15 at 12:50
  • The link in the post is broken again. Could you please update it? Also, have you considered adding it to http://packagedata.net/ ? – Szabolcs Nov 06 '15 at 09:35
  • A victim of my cloud object cleanup project ... I put a new package link above. This also adds support for the new Microsoft Edge browser. – Arnoud Buzing Nov 11 '15 at 22:51
  • Is this package Windows only? – shrx Nov 12 '15 at 08:43
  • It includes the chromedriver binaries for Mac and Linux as well, so this should work (if it doesn't work for you, let me know and I will try to fix it) – Arnoud Buzing Nov 12 '15 at 17:38
  • How can I get the package to integrate into the documentation centre, so that the documentation will be readable? (Opening the notebooks directly shows something ugly.) I don't have experience with the application structure needed for this ... – Szabolcs Nov 13 '15 at 19:40
  • I went ahead and added it to PackageData.net. Feel free to edit, you can log it with your SE account! http://packagedata.net/index.php/#package-182 – Szabolcs Nov 13 '15 at 19:43
  • @ArnoudBuzing Please take a look on this question of mine. Also the latest version of Chrome produces a warning about unsupported command-line flag when is launched via StartWebSession. Are there any updates for the package? – Alexey Popkov Jun 27 '16 at 12:27
  • @AlexeyPopkov, I replied to your question. You can ignore the unsupported command-line flag warning, I just haven't figured out how to configure Chrome to not complain about that. – Arnoud Buzing Jun 28 '16 at 16:01
  • 1
    @ArnoudBuzing I have solved the unsupported command-line flag warning problem just by downloading the latest release of the ChromeDriver (currently it is 2.22) and placing it into the appropriate directory instead of the existing file chromedriver.exe (which comes as a part of the package)! – Alexey Popkov Jul 02 '16 at 13:01
  • I put this package on github (see answer) and updated to the latest versions (2.25) of chromedriver. – Arnoud Buzing Oct 26 '16 at 15:02