4

Background

Taking an example from Decoding GZIP encoded Body, BodyBytes (ByteArray) and BodyBytesArray from URLRead:

URLRead[
 "https://api.stackexchange.com/2.2/info?site=mathematica", "Headers"
]
{ ... 
, "content-type" -> "application/json; charset=utf-8"
, "content-encoding" -> "gzip"
, ...
}
Import @ "https://api.stackexchange.com/2.2/info?site=mathematica"

returns string json which can be later put to ImportString.

Problem

URLRead though throws a bunch of decoding errors suggesting it ignores "gzip" spec and goes directly to charset specified in content-type.

URLRead[
 "https://api.stackexchange.com/2.2/info?site=mathematica", "Body"
]

Workaround

is already shown in linked topic:

URLRead[
    "https://api.stackexchange.com/2.2/info?site=mathematica"
  , "BodyBytes"
] // FromCharacterCode // ImportString[#, {"gzip", "RawJSON"}] &

Question

Should that be the case? Is it a bug or am I missing the purpose of URLRead Body?

URLFetch behaves the same so I'm surprised it wasn't asked before.

URLFetch[
    "https://api.stackexchange.com/2.2/info?site=mathematica"
  , "Content"
]

Related

Who is to blame: parsing UTF8 encoded JSON HTTPResponse fails

Kuba
  • 136,707
  • 13
  • 279
  • 740
  • 2
    Note that Import actually does not need the "content-encoding" -> "gzip" header for recognizing gzip compressed data. You can check it with file=URLDownload["https://api.stackexchange.com/2.2/info?site=mathematica"];Import@file. The created file is a binary gzip-compressed file and Import recognizes the compression method from the first few bytes of the file, the HTTP "content-encoding" header isn't necessary at all. – Alexey Popkov Aug 29 '17 at 14:20
  • 1
    @AlexeyPopkov good point, so no one cares about content-encoding :) – Kuba Aug 29 '17 at 14:21
  • I would like to understand this too (+1). We knew that Import worked, from the quoted question.. BTW, "As of Version 11, URLFetch has been superseded by URLRead and URLExecute." – rhermans Aug 29 '17 at 14:23
  • 1
    @rhermans and Alexey, now it works, does it mean it was a bug? – Kuba Oct 17 '17 at 21:04
  • @Kuba I think it was a bug since ability to recognize gzip is of crucial importance for such function as URLRead. As I wrote above, I would expect it to recognize gzip even without the explicit "content-encoding" -> "gzip". Probably the latter can be checked using the file: protocol. – Alexey Popkov Oct 18 '17 at 09:52
  • @AlexeyPopkov otoh URLFetch wasn't supporting it either and no one complained (out loud). – Kuba Oct 18 '17 at 10:28
  • This just indicates that people still do not use this functionality for serious projects... – Alexey Popkov Oct 18 '17 at 10:34
  • I just have checked: URLRead["file://localhost/D%3A/Temp/test", "Body"] (where test is a gzip-encoded file) returns compressed binary string, so it looks like URLRead needs explicit "content-encoding" -> "gzip" in order to function properly. BTW, I also found that in version 11.2.0 URLDownload saves already uncompressed data, while in version 11.1.1 it saved gzip-encoded data: URLDownload["https://api.stackexchange.com/2.2/info?site=mathematica"]. – Alexey Popkov Oct 18 '17 at 17:01
  • @AlexeyPopkov I'm fine with expecting the header to be there. I guess I will prepare a test suite to run on different versions which will tell us what works and what does not. I don't care if something is a bug or not, I need to know up fron what will happen :) p.s. I don't know when I will find time for that but I guess it is worth doing. – Kuba Oct 19 '17 at 07:57

1 Answers1

2

As of V11.2 Content-Encoding specification is respected by URLRead and e.g.

URLRead[
  "https://api.stackexchange.com/2.2/info?site=mathematica", "Body"
]

Works well.

For earlier versions one needs to use a workaround like one shown above.

Kuba
  • 136,707
  • 13
  • 279
  • 740