-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Hey Levent,
I have a computation that exists entirely in the Query
monad. I recently did some profiling and it looks like the internal Cache
is blowing up in size leading to about 95% GC use. So I'm looking for help in understanding why this behavior is occurring and how to avoid it.
The basic computation is a fold over a custom abstract syntax tree where I accumulate on SBool
s. I've been able to reproduce this with the following tiny program that folds a list of SBool
s in the query monad:
import qualified Data.SBV as S
import qualified Data.SBV.Control as SC
import qualified Data.SBV.Internals as SI
-- | generate an infinite list of unique strings and take n of them dropping the
-- empty string
stringList :: Int -> [String]
stringList n = tail . take (n+1) $ concatMap (flip replicateM "abc") [0..]
-- | the test runner, takes a computation that runs in the query monad and an
-- int that dictates size of the list of sbools
test :: ([S.SBool] -> SC.Query (Map String Bool)) -> Int -> IO (Map String Bool)
test f n = S.runSMT $
do
prop' <- S.sBools $! stringList n
SC.query $ f prop'
-- | I fold over the string of SBools here constraining at each accumulation,
-- this seems to blow up the internal cache severely leading to about 95% GC
bad :: [S.SBool] -> SC.Query (Map String Bool)
bad prop' = do b <- foldM (helper) S.true prop'
S.constrain b
fmap (fmap SI.cwToBool) $ S.getModelDictionary <$> SC.getSMTResult
-- | combine the current sbool with the accumulated sbool, constrain the
-- two and then return the accumulated result
where helper x acc = do let b = x S.&&& acc
S.constrain b
return b
-- | identical to the bad version but I do not constrain for each accumulation
good :: [S.SBool] -> SC.Query (Map String Bool)
good prop' = do b <- foldM (helper) S.true prop'
S.constrain b
fmap (fmap SI.cwToBool) $ S.getModelDictionary <$> SC.getSMTResult
-- | this helper is equivalent to just foldr' (S.&&&)
where helper x acc = do let b = x S.&&& acc
return b
main = do
putStrLn "Running Good:\n"
goodRes <- test good 1000
putStrLn "Running Bad:\n"
badRes <- test bad 1000
-- just ensuring evaluation
print (size goodRes)
print (size badRes)
I just commented out the lines for each test and ran with stack bench --profile cache-test --benchmark-arguments='+RTS -hc -s -RTS'
; I get the following results:
for good:
Running Good:
278,823,496 bytes allocated in the heap
1000
15,291,392 bytes copied during GC
Benchmark auto: FINISH
2,165,352 bytes maximum residency (7 sample(s))
110,480 bytes maximum slop
10 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 165 colls, 165 par 0.064s 0.017s 0.0001s 0.0010s
Gen 1 7 colls, 6 par 0.051s 0.014s 0.0021s 0.0042s
Parallel GC work balance: 30.47% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.004s ( 0.003s elapsed)
MUT time 0.332s ( 0.316s elapsed)
GC time 0.102s ( 0.027s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.013s ( 0.004s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 0.452s ( 0.350s elapsed)
Alloc rate 840,844,689 bytes per MUT second
Productivity 73.5% of total user, 90.3% of total elapsed
gc_alloc_block_sync: 3526
whitehole_spin: 0
gen[0].sync: 1
gen[1].sync: 35
Completed 2 action(s).
with heap profile.
and for bad:
Running Bad:
1000
2,131,924,376 bytes allocated in the heap
2,973,614,912 bytes copied during GC
239,523,080 bytes maximum residency (27 sample(s))
920,312 bytes maximum slop
472 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1497 colls, 1497 par 48.666s 12.479s 0.0083s 0.0225s
Gen 1 27 colls, 26 par 12.980s 3.371s 0.1248s 0.3105s
Parallel GC work balance: 82.55% (serial 0%, perfect 100%)
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.003s ( 0.003s elapsed)
MUT time 2.904s ( 2.630s elapsed)
GC time 53.022s ( 13.638s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 8.624s ( 2.212s elapsed)
EXIT time 0.001s ( 0.001s elapsed)
Total time 64.553s ( 18.483s elapsed)
Alloc rate 734,259,598 bytes per MUT second
Productivity 4.5% of total user, 14.2% of total elapsed
gc_alloc_block_sync: 498638
whitehole_spin: 0
gen[0].sync: 2
gen[1].sync: 689287
Benchmark auto: FINISH
Completed 2 action(s).
with heap profile
Notice the discrepancy in the GC
time and calculated Productivity
and the difference in the y-axis of the heap profiles. So I assume it is the case that (constrain $ a and b and c) /= ((constrain (a and b) >> return (a and b) >>= constrain . (and c))
. Any suggestions for whats happening here? I've already reduced the amount of constrain
s in my computation to a maximum, but I'm dealing with about 16000 variables which easily blows up the cache.
Let me know if I can further assist in any way and thanks for the help.