URLRead ignores content-encoding:gzip header

Question

Background

Taking an example from Decoding GZIP encoded Body, BodyBytes (ByteArray) and BodyBytesArray from URLRead:

URLRead[
 "https://api.stackexchange.com/2.2/info?site=mathematica", "Headers"
]

{ ... 
, "content-type" -> "application/json; charset=utf-8"
, "content-encoding" -> "gzip"
, ...
}

Import @ "https://api.stackexchange.com/2.2/info?site=mathematica"

returns string json which can be later put to ImportString.

Problem

URLRead though throws a bunch of decoding errors suggesting it ignores "gzip" spec and goes directly to charset specified in content-type.

URLRead[
 "https://api.stackexchange.com/2.2/info?site=mathematica", "Body"
]

Workaround

is already shown in linked topic:

URLRead[
    "https://api.stackexchange.com/2.2/info?site=mathematica"
  , "BodyBytes"
] // FromCharacterCode // ImportString[#, {"gzip", "RawJSON"}] &

Question

Should that be the case? Is it a bug or am I missing the purpose of URLRead Body?

URLFetch behaves the same so I'm surprised it wasn't asked before.

URLFetch[
    "https://api.stackexchange.com/2.2/info?site=mathematica"
  , "Content"
]

Note that Import actually does not need the "content-encoding" -> "gzip" header for recognizing gzip compressed data. You can check it with file=URLDownload["https://api.stackexchange.com/2.2/info?site=mathematica"];Import@file. The created file is a binary gzip-compressed file and Import recognizes the compression method from the first few bytes of the file, the HTTP "content-encoding" header isn't necessary at all. — Alexey Popkov, Aug 29 '17 at 14:20
@AlexeyPopkov good point, so no one cares about content-encoding :) — Kuba, Aug 29 '17 at 14:21
I would like to understand this too (+1). We knew that Import worked, from the quoted question.. BTW, "As of Version 11, URLFetch has been superseded by URLRead and URLExecute." — rhermans, Aug 29 '17 at 14:23
@rhermans and Alexey, now it works, does it mean it was a bug? — Kuba, Oct 17 '17 at 21:04
@Kuba I think it was a bug since ability to recognize gzip is of crucial importance for such function as URLRead. As I wrote above, I would expect it to recognize gzip even without the explicit "content-encoding" -> "gzip". Probably the latter can be checked using the file: protocol. — Alexey Popkov, Oct 18 '17 at 09:52
@AlexeyPopkov otoh URLFetch wasn't supporting it either and no one complained (out loud). — Kuba, Oct 18 '17 at 10:28
This just indicates that people still do not use this functionality for serious projects... — Alexey Popkov, Oct 18 '17 at 10:34
I just have checked: URLRead["file://localhost/D%3A/Temp/test", "Body"] (where test is a gzip-encoded file) returns compressed binary string, so it looks like URLRead needs explicit "content-encoding" -> "gzip" in order to function properly. BTW, I also found that in version 11.2.0 URLDownload saves already uncompressed data, while in version 11.1.1 it saved gzip-encoded data: URLDownload["https://api.stackexchange.com/2.2/info?site=mathematica"]. — Alexey Popkov, Oct 18 '17 at 17:01
@AlexeyPopkov I'm fine with expecting the header to be there. I guess I will prepare a test suite to run on different versions which will tell us what works and what does not. I don't care if something is a bug or not, I need to know up fron what will happen :) p.s. I don't know when I will find time for that but I guess it is worth doing. — Kuba, Oct 19 '17 at 07:57

score 2 · Accepted Answer · answered Oct 17 '17 at 21:04

2

As of V11.2 Content-Encoding specification is respected by URLRead and e.g.

URLRead[
  "https://api.stackexchange.com/2.2/info?site=mathematica", "Body"
]

Works well.

For earlier versions one needs to use a workaround like one shown above.

answered Oct 17 '17 at 21:04

Kuba

136,707
13
279
740

URLRead ignores content-encoding:gzip header

Background

Problem

Workaround

Question

Related

1 Answers1