-3

I have a list of strings in where each element of the list is in this form:

{"created_at":"Thu Aug 08 20:53:26 +0000 2013","id":365576505679568896,"id_str":"365576505679568896","text":"Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo","source":"web"

I am trying to extract some specific parts of each element. I prepared this function:

extract[string_] := StringCases[string, {
    "\"created_at\":\"" ~~ Shortest@x__ ~~ "\",\"id\":" :> x,
    "\",\"id\":" ~~ Shortest@b__ ~~ ",\"id_str\":\"" :> b,
    ",\"id_str\":\"" ~~ Shortest@c__ ~~ "\",\"text\":\"" :> c,
    "\",\"text\":\"" ~~ Shortest@d__ ~~ "\",\"source\":\"" :> d,
    "\",\"source\":\"" ~~ Shortest@e__ ~~ "\",\"truncated\":" :> e
   }
]

And then

extract/@listOFelements

But as example for the element above I get this result:

{"Thu Aug 08 20:53:26 +0000 2013", "365576505679568896", "web"}

Some elements like the text flanked by "\",\"text\":\"" and "\",\"source\":\"" are not detected from the string. How should I make it possible to detect it?

Morry
  • 585
  • 2
  • 11
  • Your first expression is not closed, it's missing the end part after web... – Murta Aug 09 '13 at 10:47
  • Even the first brace is a part of the string!! The string ends with no brace, as I have manipulated it before. – Morry Aug 09 '13 at 10:51
  • I can't copy the expression into MMA. It get's a error, where is the "","truncated":" part? Can you correct it – Murta Aug 09 '13 at 10:53
  • 1
    If the whole of your "list of strings" is one string, you don't have a list of strings. – m_goldberg Aug 09 '13 at 10:54
  • I am just copying the text directly from MMA to web! And I have a list of element which each element has the same structure as the sample above. But the information flanked by the parts in the formula are different! – Morry Aug 09 '13 at 10:59
  • If it's a big string, use \ before " as " – Murta Aug 09 '13 at 11:09

2 Answers2

4

You say you manipulated the string before and that's why it's missing a brace at the end. It looks like you may have deformed a JSON string, in which case you did yourself a big disservice as such lists can be imported by MMA.

Let's first repair your string:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\\/\\/t.co\\/vAXNgiRmYo\",\"source\":\"web\"";
repaired = str <> "}"

Now, import the string:

rules = ImportString[repaired, "JSON"];

Extract the information you want:

{"created_at", "id", "source"} /. rules

{"Thu Aug 08 20:53:26 +0000 2013", 365576505679568896, "web"}

JSON is a very popular data format, so you would do well to remember it and recognize it where it pops up.

I also note that you've asked eight questions so far and have accepted no answer for any of those questions.

C. E.
  • 70,533
  • 6
  • 140
  • 264
  • The reason I did not accept the solutions is I realized non of them is a general solution but maybe a solution like the other answer to this post which is not a consistent one!! – Morry Aug 09 '13 at 13:38
  • And I have another element which is not importable by this function!! – Morry Aug 09 '13 at 13:52
  • @Morry Perhaps the problem is the way you're asking your questions, no offense. Provide a representative sample and I will be happy to try to find a general solution. – C. E. Aug 09 '13 at 13:59
  • Please make a look to this case: https://dl.dropboxusercontent.com/u/76785824/test.nb – Morry Aug 09 '13 at 14:03
  • Maybe you are right.... You were not offensive at all. :) – Morry Aug 09 '13 at 14:03
  • @Morry I recognize that, that's from the Twitter API? The Twitter API uses JSON. Now there seems to be missing a bracket or something that causes ImportString to throw an error. Did you do anything to this string? – C. E. Aug 09 '13 at 14:04
  • If not, I would try to look at the JSON format and compare and see what differs. – C. E. Aug 09 '13 at 14:07
  • There is a huge string which is imported by xml = URLFetch[ "https://api.twitter.com/1.1/statuses/user_timeline.json?screen_\ name=BarackObama&count=50", "OAuthAuthentication" -> token] and then is split by StringCases[xml, "{\"created_at\"" ~~ Shortest@x__ ~~ "\"}"] to make a list of Tweets. Then I face the problem in this post with the second tweet and some other! – Morry Aug 09 '13 at 14:07
  • It is the complete output https://dl.dropboxusercontent.com/u/76785824/test2.nb – Morry Aug 09 '13 at 14:12
  • Don't do any splitting, all you have to do is this: ImportString[URLFetch[...],"JSON"] and it should work. It should return a list of tweets. – C. E. Aug 09 '13 at 14:16
  • @Morry I was able to verify with your second file that ImportString[URLFetch[...],"JSON"] works, or ImportString[xml,"JSON"]. The lesson here is; if they're nice enough to give you data in an importable format, don't ruin it. – C. E. Aug 09 '13 at 14:19
  • Great! The nested ImportString works perfect! The very truth is I love People here as well Mathematica!! – Morry Aug 09 '13 at 14:22
1

Assuming that you first expression is text, I prefer the RegularExpression approach as follows:

str = "{\"created_at\":\"Thu Aug 08 20:53:26 +0000 \
2013\",\"id\":365576505679568896,\"id_str\":\"365576505679568896\",\"\
text\":\"Who wears it better? #TBT \
http:\/\/t.co\/vAXNgiRmYo\",\"source\":\"web"

re=RegularExpression;
StringCases[str,
    {re["\"created_at\":\"(.+?)\""]-> "$1"
    ,re["\"id\":(.+?),"]-> "$1"
    ,re["\"id_str\":\"(.+?)\""]-> "$1"
    ,re["\"text\":\"(.+?)\","]-> "$1"
}
]

you get:

{"Thu Aug 08 20:53:26 +0000 2013","365576505679568896","365576505679568896","Who wears it better? #TBT http:\/\/t.co\/vAXNgiRmYo"}
Murta
  • 26,275
  • 6
  • 76
  • 166
  • Something to consider: (25677) – Mr.Wizard Aug 09 '13 at 12:06
  • The fact is my text is longer and there would be some duplicates within this procedure. I would like to make a more specific patterns like this: re["\"created_at\":\"(.+?)\",\"id\":"] -> "$1", re["\"id\":(.+?),\"id_str\":\""] -> "$1", re["\",\"text\":\"(.+?)\",\"source\":\""] -> "$1", re["\"id_str\":\"(.+?)\",\"text\":\""] -> "$1" where fails to make the proper result! – Morry Aug 09 '13 at 12:18
  • Can you give us a better example in your question? – Murta Aug 09 '13 at 14:05