Preamble
Just to understand why we are mixing apples and oranges and what the problem is.
In blender, images are rendered in scene referred values, meaning the data is linear and the values unbound (in an infinite scale). Such information needs to be processed through color transforms (set in the color management section) to be viewable in a display (a monitor, projector or print). Displays use a very limited scale of values (black is at 0 and white is 1) and require a transfer function (sometimes called "gamma curve"). In other words the color transforms, filmic is one of them, convert scene referred data to display referred.
The math for Compositing images works best with linear information, in an unbound scale, in other words: scene referred images are better suited for compositing.
Video images are usually recorded in a display referred format. In a limited scale and with a baked in transfer function.
It is hard to mix scene referred images with display referred images. It is a lot easier to mix them if they are in a common scale.
Unfortunately, Blender's compositor is quite limited, in the sense that has no tools to deal with images trough different paths for color processing.
A couple of options
There are couple of ways to approach the problem.
Best case scenario: Video in Raw or Log.
If the video is shot with a real camera, one that can record the raw info or using log encoding. The advantage of such formats is that they can record a lot more information and dynamic range in a way that is the images can be converted to scene referred linear values quite easy. Then all you need store them as EXR (this can be done in DaVinci resolve or other software).
Then use the EXRs in blender (or any other compositing app like Nuke, Fusion, Natron) and composite CG renders and Video images happily ever after.
Alternatively you might choose to keep the images in a log encoded format and de-log (linearize) in blender or whatever other compositing software you use.
Video not in Log or Raw.
Most video shot with consumer cameras (DSLRs or phone cameras) is recorded in a display referred format.
It is very hard to accurately convert such images to scene referred values. The difficulty lies in the way the cameras convert the values from the sensor in a non-linear way. There is no easy way to undo or reverse such conversion and restore the values of the light that was registered by the sensor. The curves used to fit the light values into the display referred container are not reversible.
The only possible way to mix those images with the cg in the realm of display referred imagery... So render in blender with filmic with no background and composite later. If you do this in blender remember to set the color management back to sRGB (also called "standard" or "default") so that both images are decoded and encoded using similar values.