Regular Expression - for html objects

Question

I did a previous research about similar questions, but I really am confuse about using html objects.

I have the following situation: I Import a html from a url and I need to pick the numbers that are in the middle of the html objects:

"    Gal
            </B>
     coord. (ep=J2000) : 
           </SPAN>
          </TD>
          <TD>
           <B>
            <TT>
    123.5769 -02.1484

    (
    ~
    ) 
    [
    4.25 3.32 137
    ] 
    A

    <A HREF="http://cdsbib.u-strasbg.fr/cgi-bin/cdsbib?1997A %26 \
    A...323L..49P">1997A&A...323L..49P</A>        </TT>
           </B>
          </TD>
         </TR>
         <TR>"

So I need the coordinates: 123.5769 -02.1484 , and its errors 4.25 3.32 137;

See that sometimes negative or positive numbers are possible, ok?

So I ll have to get a lot os htmls and pick the numbers that are in that same position.

Now I wiil explain my whole steps:

First I need to get the stars from url catalogue, changing just the last line in http address:"=1982ApJ...263..777G", "=1978ApJ...219..504L", etc..:

dataHyperlamers = 
  Import["http://simbad.harvard.edu/simbad/sim-ref?querymethod=bib&\
simbo=on&submit=submit+bibcode&bibcode=1978ApJ...219..504L", 
    "Hyperlinks"][[27 ;; 54]];
Length[dataHyperlamers]

Then, I make some "cleanning data":

$paralaxlamers = 
 Table[URLFetch[dataHyperlamers[[i]], "Content"], {i, 1, 
   Length[dataHyperlamers]}]

So, I get my first data:

$lamers = 
      TableForm[
       StringTrim[#] & /@ 
          StringCases[#, {"<TITLE>" ~~ x__ ~~ "</TITLE></head>" -> x, 
            "<TT>" ~~ 
              x : RegularExpression[
                "\\s+\\d+(\\.\\d+)\\s+\\[\\d+(\\.\\d+)\\]"] -> 
             x}] & /@ $paralaxlamers];

And I export it as .dat:

Export["TAbCataloglamersParalax.dat", $lamers];

And, Now I need to do the same as above , but getting different data from the same urls:

$lamersGal = 
     TableForm[
      StringTrim[#] & /@ 
         StringCases[#, {"<TITLE>" ~~ x__ ~~ "</TITLE></head>" -> x, 
           "Gal" ~~ 
             x : RegularExpression[
               "\\d+(\\.\\d+)\\s+\\-\\d+(\\.\\d+)\\s+\\(\\s+Optical\\s+\\)\
    \\s+\\[\\d+(\\.\\d+)\\s+\\d+(\\.\\d+)\\]\\s+"] ~~ "<A" -> 
            x}] & /@ $paralaxlamers]

But, it is not working the last data collection...

Of course , at the final movement I will export it as .dat. So I will have tables with paralax , and galactic position of each star from the catalogues.

Can you post a link to a valid and complete HTML page to use as example in the answer? It's important that it should be correct HTML, not a fragment, like what you posted above. — Szabolcs, Dec 03 '14 at 18:39
It is inside the Import function, http://simbad.harvard.edu/simbad/sim-ref?querymethod=bib&
simbo=on&submit=submit+bibcode&bibcode=1978ApJ...219..504L — locometro, Dec 03 '14 at 18:41
http://simbad.harvard.edu/simbad/sim-ref?querymethod=bib&simbo=on&\submit=submit+bibcode&bibcode=1978ApJ...219..504L; I do not why but inthis comments the final part does not enter as hyperlink, but it is part of the http. — locometro, Dec 03 '14 at 18:49
The right final lines are : "=1982ApJ...263..777G";"=1978ApJ...219..504L" — locometro, Dec 03 '14 at 18:59

score 1 · Accepted Answer · answered Dec 03 '14 at 20:42

A very localized solution, pretend I don't know anything about html or how imposable it is to parse..

   html = URLFetch[
      "http://simbad.cfa.harvard.edu/simbad/sim-id?Ident=%40132509&Name=*\
      %20gam%20Cas&submit=submit", "Content"];
   strtakeafterpos[string_String, pat_, off_] :=
         StringTake[ string , {StringPosition[ string , pat][[1, 1]] + off - 1 , -1} ]
   strtakebeforepos[string_String, pat_, off_] :=
         StringTake[ string , StringPosition[ string , pat][[1, 1]] + off - 2 ]
   ImportString[
       strtakebeforepos[strtakeafterpos[ strtakeafterpos[html, "Gal", 3]  ,
         "\<TT\>" , 4 ] , "A", 0] , "Table"][[{2, 8}]]

{{123.577, -2.1484}, {4.25, 3.32, 137}}

This works for all for all of the files in your example, but of course is going to break if the web developer changes anything..

Worked fine! And very fast – locometro Dec 05 '14 at 00:48 — locometro, Dec 05 '14 at 00:48

Regular Expression - for html objects

1 Answers1

Linked