10

I want to make a picture grid of all Nobel Laureates in physics using mathematica. Later I wanna print this picture grid on a big wall!

The first thought I came up with is to exploit the powerful wolfram free format. But I encountered many difficulties.

  1. image resolution

for example

enter image description here

There is one problem, the resolution(dpi) of the image is quite low. How to get a high dpi image using free format.

  1. How to get all Nobel Laureates in physics?

I tried "all Nobel Laureates in physics" and "Nobel Laureates in physics from 1901 to 2015" both failed

  1. I also want the birth and death date of each Nobel Laureates

to sum up I want a list of data like this

{...,{1921,Albert Einstein,"14 March 1879","18 April 1955",image_of_Einstein},....}

With this data list, I can later create a picture grid using Mathematica command, with each image labeled with name and year information


Another possible way

If it is hard to be realized using Wolfram Knowledge. Another possible way, I now think probably better is to grab data from www.nobelprize.org, with full list of all Nobel Laureates in physics, and all essential information of each Nobel Laureates. For example on this page about Einstein. There is even "Prize motivation" which is also what I want.

enter image description here

However, again, using mathematica to grab data on webpage is something I don't know how to do.

matheorem
  • 17,132
  • 8
  • 45
  • 115

3 Answers3

12

Here's something to get you started down to path of scraping the somewhat larger individual pictures from the Nobel website:

links = Import[
   "https://www.nobelprize.org/nobel_prizes/physics/laureates/index.html?images=yes", "Hyperlinks"];

individualpagelinks =
  Select[
   links,
   StringMatchQ[
    "https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~ NumberString ~~ "/" ~~ name__ ~~ "-facts.html"]
   ];

postcardpictures =
  StringCases[
     individualpagelinks,
     "https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~ year : NumberString ~~ "/" ~~ name__ ~~ "-facts.html"
      :>
      "https://www.nobelprize.org/nobel_prizes/physics/laureates/" <> year <> "/" <> name <> "_postcard.jpg"
     ] // Flatten // DeleteDuplicates;

Import /@ postcardpictures[[1 ;; 5]]

sample pictures


I found it easier to extract the rationale for the prizes from the Wikipedia table of Nobel Prize winners in physics:

wikidata = Import[
             "https://en.wikipedia.org/wiki/List_of_Nobel_laureates_in_Physics", 
             "Data"
           ];
Cases[
  wikidata,
  {year_, name_, _, rationale_}
  :>
  {year, StringDelete[rationale, {Whitespace ~~ "[" ~~ NumberString ~~ "]", "\""}]},
  Infinity
][[2 ;; -2]]

(* Out: 
{
 {1901, "in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"}, 
 {1902, "in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena"}, 
...
}
*)

Some manual cleanup will be necessary here: the somewhat naive method I proposed is confused by nested tables...

MarcoB
  • 67,153
  • 18
  • 91
  • 189
5

Using jSoupLink:

<< jSoupLink`
ParseHTML[
  "https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-facts.html",
  ".laureate_info_wrapper p",
  "text"
  ] // TableForm

Mathematica graphics

It is possible to be more precise:

ParseHTML[
 "https://www.nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-facts.html",
 "span[itemprop=birthDate]",
 "text"
 ]
{"14 March 1879"}

I don't intend to explain how I figure out the CSS rules but there are a lot of things you can do quite easily with this jSoupLink if you know how. You could write a script that starts from this directory of Nobel Prizes and recursively collect data from all laureates, for example.

C. E.
  • 70,533
  • 6
  • 140
  • 264
  • Thank you very much C.E! To looking into the source of html is really helpful. I also found importing link with option "XMLObject" realizes similar effect as jSoupLink – matheorem May 13 '16 at 14:53
  • @matheorem Importing HTML as symbolic XML is the traditional approach, but parsing symbolic XML with Cases and patterns is cumbersome. CSS rules are far superior and that is what jSoupLink allows for. – C. E. May 13 '16 at 18:38
  • I regret that I didn't try "jSoupLink" last night. I was reluctant to install an external package. But I tried now, "jSoupLink" is really handy and powerful in extracting html data. Thank you very much for your package ! – matheorem May 14 '16 at 06:10
  • However there is a issue. I tried to use ParseHTML in my getData. But found ParallelMap is not working, though Map is working. – matheorem May 14 '16 at 06:11
  • You can try this ParallelMap[ StringSplit[ ParseHTML[#, ".laureate_info_wrapper p", "text"], {",", ":"}] &, individualpagelinks[[1 ;; 4]]] It will give errors. But Map works. It seems that something wrong with StringSplit, because if I remove StringSplit, then ParallelMap is working – matheorem May 14 '16 at 07:33
  • Hi, C.E. I am wondering if you notice my comments. I would appreciate your reply. – matheorem May 15 '16 at 13:56
  • @matheorem I noticed your second comment which seemed to indicate that it has to do with StringSplit and not the package so I didn't consider it to be my problem to solve. Also the package was released some time ago and is strictly provided "as is", I don't have any plans for an update at the moment. In general ParallelMap is not the right thing to do here. You can collect the website information much faster by using URLFetchAsynchronous and then using ParseHTMLString on the content that you've downloaded. – C. E. May 15 '16 at 19:20
  • Thank you very much C.E. I will post another post to address the ParallelMap issue. But could you please give an example on URLFetchAsynchronous and ParseHTMLString . I look into the documentation of URLFetchAsynchronous However, I just can't understand what does it say – matheorem May 16 '16 at 11:04
  • @matheorem I don't know how to explain that function better than the documentation I'm afraid. But as for the things specific to this problem, your event function should be listening for the "data" event, and it should call ParseHTMLString on that data. Try to look here for examples. – C. E. May 16 '16 at 11:23
  • Hi, C.E. I am very sorry to bother you again. But I still can't get it. For example, I modified an example in the doc like this data = {}; eventFunction[_, "data", datas_] := data = datas URLFetchAsynchronous["https://www.nobelprize.org/nobel_prizes/physics/\ laureates/index.html?images=yes", eventFunction]. But I got a list of numbers, what should I do? – matheorem May 16 '16 at 14:47
  • Use FromCharacterCode. There are examples of that on here already, including my own. – C. E. May 16 '16 at 15:28
  • Thank you so much, C.E. That is an interesting post. I never thought that MMA can do such a thing. But I still got a confusion. I saw there is also URLFetch and as I understand it, the only difference bewteen URLFetch and URLFetchAsynchronous is that latter is running in background, am I right? So why use URLFetchAsynchronous? Is data = ParallelMap[URLFetch, individualpagelinks]; ParseHTMLString[#, ".laureate_info_wrapper p", "text"] & /@ data the right way that you previously suggest? – matheorem May 17 '16 at 13:33
  • @matheorem No, you still don't understand what an asynchronous program is. It's very common in programming to do certain tasks asynchronously. If you try to download four files in parallel it means you are running four kernels and downloading one file on each kernel. Each kernel does nothing while it's waiting for the file to download. If you download forty files asynchronously you are downloading forty files simultaneously on one kernel, and the kernel is free to do other stuff in the meantime. I don't know how it handles callbacks, if it can parallelize it... (to be continued) – C. E. May 17 '16 at 13:44
  • ...even if it cannot parallelize the application of callbacks efficiently and you need to do that manually, it still makes no sense to have kernels just waiting around while you are downloading the files. Download the files asynchronously to your hard drive and then go from there. But try with callbacks first. – C. E. May 17 '16 at 13:45
  • But URLFetchAsynchronous["url",func] fetches one url each time. I don't know ways other than ParallelMap[URLFetchAsynchronous[#,func]&,linklist] to make it parallel. Sorry for my poor comprehension : ) – matheorem May 17 '16 at 13:49
  • @matheorem But the program is asynchronous. That's the magic thing about it... if you write URLFetchAsynchronous["url1",func]; URLFetchAsynchronous["url2",func]; URLFetchAsynchronous["url3",func] then it starts to download three files in parallel (on one kernel). Normally code in Mathematica is synchronous, meaning it starts with the first expression, then moves on to the second, and then to the third. But URLFetchAsynchronous is an asynchronous command. It moves on to the next expression immediately. – C. E. May 17 '16 at 13:52
  • Aha, so that is the key point! So URLFetchAsynchronous[#,func]&/@linklist will actually be running parallelized, am I right? – matheorem May 17 '16 at 13:55
  • @matheorem It's not documented how it handles callbacks, if it puts them in a queue or whatever so you'll have to experiment. But yes, that's the idea. It is a much better way to parallelize the downloading of files. If you need to process with ParallelMap you can always save the downloaded text in a variable or a file (URLSaveAsynchronous) first. – C. E. May 17 '16 at 13:56
  • Thank you so much, C.E. I learned important concept from you : ) I will try it later to see if it works. – matheorem May 17 '16 at 13:59
  • Sorry, I got troubles : ( I tried Do[ Clear[eventFunction]; eventFunction[_, "data", htmlData_] := (data[i] = FromCharacterCode@ htmlData); URLFetchAsynchronous[individualpagelinks[[i]], eventFunction], {i, 1, 3}], in order to store different result in different variable. However, ?data shows that, the data is stored in data[i], why? – matheorem May 17 '16 at 14:44
  • @matheorem Use AppendTo to store results instead. The reason is that the right hand side in SetDelayed is held. – C. E. May 17 '16 at 15:54
  • Thanks. AppendTo works, and URLFetchAsynchronous is indeed faster even on a 4 core computer. result is here http://i64.tinypic.com/kcy0yd.jpg But the problem is AppendTo will not preserve the order. That's why I want to store in data[i]. On the other hand, I think it is because := is held, i can be replaced by Do index, and my former approach should be working. However it is not. – matheorem May 18 '16 at 01:08
4

Thanks to MarcoB and C. E. From them I learned how to deal with HTML contents using Mathematica.

I now summerize my final approach below (it is a bit long, so I make it an answer).

In this approach, I use information all from www.nobelprize.org and mathematica features that are all built-in.

individualpagelinks is a list of all Nobel Laureates information page hyperlinks(I learned from MarcoB)

links = DeleteDuplicates@
   Import["https://www.nobelprize.org/nobel_prizes/physics/laureates/\
index.html?images=yes", "Hyperlinks"];
individualpagelinks = 
  Select[links, 
   StringMatchQ[
    "https://www.nobelprize.org/nobel_prizes/physics/laureates/" ~~ 
     NumberString ~~ "/" ~~ __ ~~ "-facts.html"]];

To fetch essential information on "...-facts.html". The key is to import with option "XMLObject". like this

Import[individualpagelinks[[1]], "XMLObject"]

To know which expression contains the information you want, just Ctrl+F and search, for example search "birth" in the output cell, and you can find XMLElement["span", {"itemprop" -> "birthDate"}, {"9 March 1959"}] contains the information

Then use Cases to get all information you need.

In getData, The order of information: image, year, given name, family name, birth date, birth place, death date, death place, prize motivation, fields(if it exists)

Clear[getData];
getData[link_] := Module[{data},
  data = Import[link, "XMLObject"];
  {Import[StringReplace[link, "-facts.html" -> "_postcard.jpg"]],
   StringTrim /@ {StringCases[link, NumberString][[1]],
     Cases[data, 
       XMLElement["span", {"itemprop" -> "givenName"}, {x_}] -> x, 
       Infinity][[1]],
     Cases[data, 
       XMLElement["span", {"itemprop" -> "familyName"}, {x_}] -> x, 
       Infinity][[1]],
     Cases[data, {XMLElement["strong", {}, {"Born:"}], ___}, 
       Infinity][[1, 3, -1, 1]],
     StringSplit[
      Cases[data, {XMLElement["strong", {}, {"Born:"}], ___}, 
        Infinity][[1, -1]], ","],
     Sequence@
      If[tmp = 
        Cases[data, {XMLElement["strong", {}, {"Died:"}], ___}, 
         Infinity]; tmp =!= {},
       {tmp[[1, 3, -1, 1]], StringSplit[tmp[[1, -1]], ","]}, "live"],
     Cases[
       data, {XMLElement["strong", {}, {"Prize motivation:"}], x_} -> 
        x, Infinity][[1]],
     If[tmp = 
       Cases[data, {XMLElement["strong", {}, {"Field:"}], x_} -> x, 
        Infinity]; tmp =!= {}, tmp[[1]], Nothing]}}]

labeledPicture label image with year, name, country of birth

labeledPicture[dataEntry_] := Labeled[dataEntry[[1]],
  Column[{dataEntry[[2, 1]], 
    dataEntry[[2, 2]] <> " " <> dataEntry[[2, 3]], 
    " (" <> dataEntry[[2, 5, -1]] <> ")"}, "Center"]]

Here is an example with recent 10 Nobel Laureates in physics

data = ParallelMap[getData, individualpagelinks[[1 ;; 10]]];
Grid[Partition[labeledPicture /@ data, 5, 5, {1, 1}, {}]]

This will give

enter image description here

matheorem
  • 17,132
  • 8
  • 45
  • 115