2

I'm trying to import all the tables on a website, this one in particular:

For local files that are already tabular, Import[#, "Table"] works well, but I can't seem to easily import simple tables embbeded in a sites.

Karsten7
  • 27,448
  • 5
  • 73
  • 134
M.R.
  • 31,425
  • 8
  • 90
  • 281
  • At first glance, you may need to look outside of Mathematica for this one. It looks like the prices are generated with JS, onload. Selenium makes a Java package you could most likely wrap into Mathematica. – Peter Roberge Oct 05 '15 at 18:17
  • If you know the basics of CSS selectors and HTML, then you can download the website after the prices have been loaded and use jSoupLink on your downloaded files. – C. E. Oct 05 '15 at 18:34
  • @Pickett that's a good idea! – M.R. Oct 05 '15 at 20:23
  • @PeterRoberge I'm looking into other platforms... but I would think URLFetchAsynchronous[] could handle this? – M.R. Oct 05 '15 at 20:25
  • @M.R. I'm not sure- from the documentation it isn't clear if the JS calls are virtualized. The only implementation I've seen personally implement GUI-less browser actions is the HtmlUnitDriver – Peter Roberge Oct 06 '15 at 15:31
  • @PeterRoberge There is a package called WebUnit that can do this. Without that there is no way that Mathematica can evaluate JavaScript. – C. E. Oct 06 '15 at 22:05
  • When I said "...download the website after the prices have been loaded" I meant going to the website in the browser, download it and then import it from the hard drive. This seems to me the easiest option if it's just a one time thing. – C. E. Oct 06 '15 at 22:07
  • 1
    Amazon loads a JavaScript file which contains the data in JSON format with a JavaScript 'callback'. For example for Linux the file is http://a0.awsstatic.com/pricing/1/ec2/linux-od.min.js You could parse this file and the other files similarly to get the data. –  Oct 06 '15 at 23:53

1 Answers1

1

This is only a partial answer because it turns out the AWS JSON is surprisingly complicated. Anyways this might help you get there eventually. Lots of work left unfortunately.

str = Import["http://a0.awsstatic.com/pricing/1/ec2/linux-od.min.js", "String"];
pos = StringPosition[str, "callback"];
obj = StringTake[str, {pos[[1, 2]] + 2, StringLength[str] - 2}];
json = StringReplace[obj, ":" -> "->"];
json = StringReplace[json, "{" -> "<|"];
json = StringReplace[json, "}" -> "|>"];
json = StringReplace[json, "[" -> "{"];
json = StringReplace[json, "]" -> "}"];
json = ToExpression[json];
  • Import the JavaScript file pointed to on the webpage that contains the data.
  • Extract the JSON part (getting rid of JavaScript stuff, comments, etc)
  • String replace JSON parts to Mathematica equivalents and to turn this into a Mathematica Association with ToExpression.

Now we have an association. We can now access the data in a hierarchical manner e.g.

json[config][regions][[1]][instanceTypes][[1]][sizes][[1]]

which gives us the result:

<|size -> "t2.micro", vCPU -> "1", ECU -> "variable",
memoryGiB -> "1", storageGB -> "ebsonly",
valueColumns -> {<|name -> "linux", prices -> <|USD -> "0.013"|>|>}|>