How to put side caption in a figure that is wrapped inside a a text?

Question

I am relatively new and struggle to combine different packages. I am trying to wrap a figure within my text and I also want the caption to be aligned on the side of the text. My current approach looks as follows, while the red arrows show what I mean to improve:

In order to wrap the image within the text I am using the package wrapfig as follows:

\begin{wrapfigure}{R}{0.5\textwidth}
    \vspace{-25pt}
    \centering
    \includegraphics[scale=0.41]{figures/transformer3.png}
    \caption[Illustration of multiheaded attention]{Illustration of multiheaded attention. The two highlighted attention heads have learned to associate \textit{"it"} with different parts of the sentence.}
    \label{fig:transformer3}
\end{wrapfigure}
\noindent The projections are parameter matrices $\bm{W}{i}^{Q} \in \mathbb{R}^{d \times d_q}$, $\bm{W}{i}^{K} \in \mathbb{R}^{d \times d_k}$, $\bm{W}_{i}^{V} \in \mathbb{R}^{d \times d_v}$ and $\bm{W}^{O} \in \mathbb{R}^{hd_v \times d}$. By applying multiple attention heads, the model is allowed to jointly attend to information at different positions within the input sequence. In figure \ref{fig:transformer3} for example, the orange attention head associates \textit{“it”} with \textit{“The animal”}, while the green attention head has learned an association to “tired”.
\subsubsection*{Outlook on the Empirical Studies}
While the U-Net and the stacked hourglass are already well established architectures in the CV domain, Transformers have been mainly applied on NLP problems so far. However, there is a strong belief within the deep learning community that Transformers may represent a suitable architecture for CV tasks as well. For this reason, the empirical study will investigate on recent approaches to apply self-attention based networks on images. The concepts will then be implemented in a neural network that will be trained on a CV task. Finally, the performance will be evaluated against models that instead rely on the U-Net and the stacked hourglass.

For side captions, I read that the package floatrow should be useful. However, when I try to combine both, I get compilation errors. I also found an introductory usage here. Again, I can reproduce this, but in this case I struggle to align it with my text appropriately. Can anyone help me out here? Thank's a lot!

I think that, for readability, you should interrupt the main text for the whole height of the figure, which means you don't need wrapfig. — Bernard, Apr 02 '21 at 09:33
In this case I sort of agree with you. But in another section I definitely have to use it like that, because the image is so tiny. So it would basically be exactly the same setup. — spadel, Apr 02 '21 at 09:35

Zarko · Accepted Answer · 2021-04-02T10:44:46.737

The same base setup is fragile. I would rather stick with two of them, one for floats which interrupt a text and one for wrapfig in text:

\documentclass{article}
\usepackage{amssymb, bm}
\usepackage[export]{adjustbox}
\usepackage{wrapfig}
\usepackage[outercaption]{sidecap}
\makeatletter
\def\SC@figure@vpos{m}
\makeatother
\usepackage{tabularx}
\usepackage[font={small, sf},labelfont=bf]{caption}
\begin{document}
\noindent The projections are parameter matrices $\bm{W}{i}^{Q} \in \mathbb{R}^{d \times d_q}$, $\bm{W}{i}^{K} \in \mathbb{R}^{d \times d_k}$, $\bm{W}_{i}^{V} \in \mathbb{R}^{d \times d_v}$ and $\bm{W}^{O} \in \mathbb{R}^{hd_v \times d}$. By applying multiple attention heads, the model is allowed to jointly attend to information at different positions within the input sequence.
\begin{SCfigure}[50][ht]
    \centering
    \includegraphics[scale=0.41]{example-image-duck}%{figures/transformer3.png}
    \caption[Illustration of multiheaded attention]
            {Illustration of multiheaded attention. The two highlighted attention heads have learned to associate \textit{"it"} with different parts of the sentence.}
    \label{fig:transformer3}
\end{SCfigure}
In figure \ref{fig:transformer3} for example, the orange attention head associates \textit{“it”} with \textit{“The animal”}, while the green attention head has learned an association to “tired”.
\subsubsection*{Outlook on the Empirical Studies}
\begin{wrapfigure}[5]{R}{0.65\textwidth}
\vspace{-1.75\baselineskip}
    \begin{tabularx}{\linewidth}{@{} cX @{}}
    \includegraphics[scale=0.41,valign=T]{example-image-duck}%{figures/transformer3.png}
    &
    \caption[Illustration of multiheaded attention]
            {Illustration of multiheaded attention. The two highlighted attention heads have learned to associate \textit{"it"} with different parts of the sentence.}
    \label{fig:transformer3}
    \end{tabularx}
    \end{wrapfigure}
While the U-Net and the stacked hourglass are already well established architectures in the CV domain, Transformers have been mainly applied on NLP problems so far. However, there is a strong belief within the deep learning community that Transformers may represent a suitable architecture for CV tasks as well. For this reason, the empirical study will investigate on recent approaches to apply self-attention based networks on images. The concepts will then be implemented in a neural network that will be trained on a CV task. Finally, the performance will be evaluated against models that instead rely on the U-Net and the stacked hourglass.
\end{document}

Thanks a lot for this great answer! I will work myself through it! :) — spadel, Apr 02 '21 at 10:44

John Kormylo · Answer 2 · 2021-04-02T16:04:58.327

This shows how to do it using paracol. The only problem is that you have to manually split paragraphs using \splitpar and \continuepar. OTOH, paracol is far more robust than wrapfig.

\documentclass{article}
\usepackage{amssymb, bm}
\usepackage[export]{adjustbox}
\usepackage{paracol}
\usepackage[font={small, sf},labelfont=bf]{caption}
\newsavebox{\textbox}
\newcommand{\splitpar}[2][\textwidth]{% #1 = width of column (optional), #2 = rest of paragraph after split
  \unskip\strut{\parfillskip=0pt\parskip=0pt\par}%
  \global\setbox\textbox=\vbox{\hsize=#1\relax\noindent\strut #2\strut}}
\newcommand{\continuepar}{\unvbox\textbox}
\begin{document}
\setcolumnwidth{\dimexpr 0.5\textwidth-\columnsep}% second column uses remainder
\begin{paracol}{2}
\sloppy% SOP for narrow columns
\noindent The projections are parameter matrices $\bm{W}{i}^{Q} \in \mathbb{R}^{d \times d_q}$, $\bm{W}{i}^{K} \in \mathbb{R}^{d \times d_k}$, $\bm{W}_{i}^{V} \in \mathbb{R}^{d \times d_v}$ and $\bm{W}^{O} \in \mathbb{R}^{hd_v \times d}$. By applying multiple attention heads, the model is allowed to jointly attend to information at different positions within the input sequence.
In figure \ref{fig:transformer3} for example, the orange attention head associates \textit{“it”} with \textit{“The animal”}, while the green attention head has learned an association to “tired”.
\switchcolumn
\begin{figure}[h!]
    \includegraphics[width=\linewidth, height=4in]{example-image}
\end{figure}
\switchcolumn
\begin{figure}[h]
    \caption[Illustration of multiheaded attention]
            {Illustration of multiheaded attention. The two highlighted attention heads have learned to associate \textit{"it"} with different parts of the sentence.}
    \label{fig:transformer3}
\end{figure}
\subsubsection*{Outlook on the Empirical Studies}
While the U-Net and the stacked hourglass are already well established architectures in the CV domain, 
Transformers have been mainly appl-\splitpar{ied on NLP problems so far. However, there is a strong belief within the deep learning community that Transformers may  represent a suitable architecture for CV tasks as well. For this reason, the empirical study will investigate on recent approaches to apply self-attention based networks on images. The concepts will then be implemented in a neural network that will be trained on a CV task. Finally, the performance will be evaluated against models that instead rely on the U-Net and the stacked hourglass.}
\end{paracol}
\continuepar
\end{document}

Thank's a lot, I will definitely try this one out too! :) – spadel Apr 02 '21 at 18:55 — spadel, Apr 02 '21 at 18:55

How to put side caption in a figure that is wrapped inside a a text?

2 Answers2

Linked