| Version/Date | Changes | 
27.10.2025 v0.7.11 Docs | 
Highlights
- [Experimental] Significantly improved VDims system, to enable more dynamic models.
 
- [Experimental] Added support for 
torch.compile(..., dynamic=True) and torch.fx symbolic operators. 
- [Experimental] Added PyTorch graph breaks, to enable using 
torch.distributed calls within the model. 
 
Closed Issues
- #1888	[PyTorch] Test v2.9.0
 
- #1887	[DNN] Stop GEMM autotuning if other solution was found that is significantly faster
 
- #1886	[HLIR] Don't use generator_device for hash if there are Input or Param
 
- #1885	[PyTorch] set torch._dynamo.config.allow_rnn=True
 
- #1884	[VE] Reduce malloc overhead for >4GB allocations
 
- #1883	[DNN] Don't autotune GEMM if dims are unknown
 
- #1882	[HLIR] Add VDim support to arange
 
- #1881	[PyTorch] Evaluate torch.compile(..., dynamic=True)
 
- #1880	[HLIR] Improve graph breaking and detection of identical sub-graphs
 
- #1879	[HLIR] Transform Split(Indices, 2) into 2x Indices
 
- #1878	[PyTorch] Investigate why multiple runs often lead to unstable hash generation
 
- #1877	[Runtime] Share offloaded tensors across multiple instances of runtime::Network
 
- #1876	[PyTorch] Evaluate torch._dynamo.config.capture_scalar_outputs = True
 
- #1873	[Keras] efficientnetb0 "Cannot modify VDims at this point anymore!"
 
- #1872	[TransparentOffload] Store exactly which tensors need to be copied back
 
- #1870	[DFP] WeightConv broken in VE
 
- #1869	[HLIR] Chunk can remove DimSym if the inputs get changed to static dims
 
- #1868	[VEDNN] Segfault vednnConvolutionBackwardFilter with 1x1 kernel, 3in, 3out and stride==2
 
- #1866	[VE] PyTorch accuracy
 
- #1861	[PyTorch] Gradient Accuracy
 
- #1860	[Runtime] Set host_handle/device_handle for output in NetworkRenderer and remove from module.cpp
 
- #1859	[TensorFlow] Test v2.20.0
 
- #1857	[Runtime] Simplyfy sol_ctx
 
- #1856	[CUDA] libcudart.so requires to link to -ldl in NVIDIA HPC CUDA Toolkit
 
- #1855	[CUDA] Problems finding CUDA toolkit with NVIDIA HPC package
 
- #1849	[PyTorch] einops.einops.reduce
 
- #1839	[PyTorch] Split torch.compile models
 
- #1765	[VE] ANEMOI using no-autotuning is faster on VE than with autotuning
 
- #1701	[HLIR/YAAL/Wrapper] Add "clone" operator.
 
- #1292	[PyTorch] torch.compile(...) does not pass models containing RNN layers to custom compilers
 
- #90	[VEDNN] possible error in conv2d_bwd_filter in SqueezeNet 1.0
 
 
 | 
01.08.2025 v0.7.10 Docs | 
Closed Issues
		
- #1853	[DFP] allocates too much data for intermediate tensors
 
- #1852	[VE] Report out-of-memory
 
- #1850	[DEBUG] Add stack traces to original code if available
 
 
 | 
30.07.2025 v0.7.9 Docs | 
Closed Issues
		
- #1846	[PyTorch] add "amin" and "amax" to torch.scatter_reduce
 
 
 | 
28.07.2025 v0.7.8 Docs | 
Highlights
- Added more PyTorch operators
 
- Enabled usage of VDims within Concat and Split
 
- Experimental VDims recommendation system, that automatically sets the batch dimension to be vdim in CNNs and batch and sequence dimensions in RNNs.
 
 
Closed Issues
		
- #1843	[PyTorch] Investigate Transformer VMap cases that don't work properly
 
- #1842	[PyTorch] torch.where(condition)
 
- #1840	[PyTorch] Improve VDims in Split
 
- #1837	[PyTorch] torch.nonzero
 
- #1836	[PyTorch] torch.unique[_consecutive]
 
- #1835	[PyTorch] torch.sort
 
- #1833	[ISPC] Issue with integer postfix
 
- #1832	[HLIR] Add new VDim Recommendation system
 
- #1809	[HLIR] Enable Concat with VDim inputs
 
- #1500	[PyTorch] torch.amax
 
 
 | 
10.07.2025 v0.7.7 Docs | 
Closed Issues
- #1830	[Keras v2] mixed_float16 issue
 
- #1827	[HLIR] "Implementation Error" in sol::compiler::hlir::functional(sol::compiler::LayerOutput*, sol::compiler::Functional::Type)
 
- #1825	[TF/CUDA] out of memory during training
 
- #1824	[CUDA] Error nvcuda::wmma::precision::tf32 on CC == 7.5
 
- #1817	[CUDNN] Add CUDNN v8 backend, to enable Conv in older TF versions
 
 
 | 
09.07.2025 v0.7.6 Docs | 
Highlights
- Added support for Keras v3, enables TensorFlow ≥ 2.16
 
- Improved performance when using 
sol.device.set(...) and torch.compile(...) by reducing memcopy overhead. 
 
Closed Issues
- #1823 [VE] NCC vectorized DFP for loop that is marked as "non vectorizable"
 
- #1821 [PyTorch] Wrong gradients for MaxUnPooling
 
- #1818 [HLIR] fix wrong integer postfixes
 
- #1816 [Python] Deprecated sol.backends API
 
- #1815 [TF] Problem using transparent offloading in predict or fit
 
- #1814 [Keras] stateless random layers
 
- #1813 [PyTorch] Changed random operators in v2.7
 
- #1811 [Keras v3] Keras Applications DenseNet wrong moving mean/variance in training
 
- #1810 [HLIR/DNN] Mixed Precision RNN
 
- #1808 [WHL] Broken nec-sol-dist-ve-vednn dependency
 
- #1806 [CUDNN] Training on ShuffleNet can't be executed on NVIDIA GPUs
 
- #1804 [TF] Wrong gradients in RNN
 
- #1803 [TF] Can't execute model as static vdims can't be multiplied using None * None
 
- #1802 [TF] New Keras Parser produces wrong random numbers with seed=None
 
- #1799 [Docs] Add CUDA/CUDNN requirements
 
- #1798 [SQL] TRANSACTIONS cause significant waiting times in other processes
 
- #1796 [CUBLAS] SOL DNN CUBLASlt requires at least CUDA >= 11.8
 
- #1789 [PyTorch] store torch.nn.Parameter in ctx.params if using torch.compile to reduce memcpy overhead in transparent offload
 
- #1755 [TF] TF_Activation fails because of _keras_logits
 
- #1392 [TF] Keras v3 support (TF > v2.15)
 
 
 | 
05.06.2025 v0.7.5 Docs | 
 Bugfix for Keras RNN parsing and added TF Einsum. 
Closed Issues
- #1794 [TF] add Einsum
 
- #1793 [TF] Parsing error when Dropout -> RNN
 
- #1792 [Docs] Add iWAPT presentation
 
 
 | 
23.05.2025 v0.7.4 Docs | 
 Bugfix release to relax dist dependency requirements. 
Closed Issues
- #1787 [WHL] Relax dist dependencies to allow 1.0.post1 like bugfixes
 
 
 | 
19.05.2025 v0.7.3 Docs | 
 Bugfix release adding PadV2 handler to TensorFlow. 
Closed Issues
- #1786 [TF] Missing PadV2 handler
 
 
 | 
24.04.2025 v0.7.2 Docs | 
 Bugfix release upgrading to changed compile flags for PyTorch v2.7.0
integration. 
Closed Issues
- #1785 [PyTorch] Fix module for v2.7.0, which requires CXX11 ABI
 
 
 | 
07.03.2025 v0.7.1 Docs | 
 Bugfix release fixing library loading issues on some systems and improving
compatibility to PyTorch v2.6.0. 
Closed Issues
- #1776 	[PyTorch] Unable to find shape in tensor
 
- #1775 	[Tungl] Prevent SOL from loading /opt/nec/veos/lib64/libtungl.so.0
 
- #1768 	[PyTorch] Missing torch.ops.higher_order.autograd_function_apply
 
 
 | 
28.02.2025 v0.7.0 Docs | 
Highlights
- NEC SX-Aurora (VE) training support. Please read the NEC SX-Aurora section for device specific options.
 
- Better control over model determinism and performance.
 
- Performance optimizations for X86, NVIDIA and NEC SX-Aurora devices.
 
- Improved Gather/Scatter implementations that now support advanced slicing modes, e.g., 
torch.index_put. 
- Improved automatic VDim detection.
 
- Improved low precision handling (on supported devices).
 
 
Closed Issues
- #1754 	[Wrapper] Enable Non-Output Models
 
- #1746 	[DFP] Unable to subtract (802816 * #0 ~ 802816) from (0 + (1024 * #0 ~ 1024)) in debugMemory
 
- #1744 	[DFP] Faulty Scheduling
 
- #1742 	[DFP] Performance
 
- #1741 	[C-API] Remove Tungl dependency
 
- #1740 	[PyTorch] Wrong result torch.Tensor.scatter_add_
 
- #1739 	[PyTorch] Enable sol.optimize(model.member_function)
 
- #1737 	[TF] Fix cloning of Keras Legacy layers
 
- #1736 	[HLIR] Fix Pooling::mergePadding for MaxPooling
 
- #1735 	[NVIDIA] Add API to increase/fetch/free a Handle maintained Workspace.
 
- #1733 	[Keras] Add _keras_logits to SoftMax and Sigmoid
 
- #1732 	[CUDNN] Performance Regression in DeConv/AlexNet
 
- #1731 	[CUBLAS] Autotune if using TF32 is better or not and let SOL adjust the determinism accordingly!
 
- #1730 	[JSON] Compress DType
 
- #1729 	[DFP] Faulty broadcasting in Gather case
 
- #1728 	[CUDNN] add v9 bias/postops
 
- #1726 	[Compiler/DFP] add `gen::Func post` to all renderSize/Shape methods to allow easier `uniform` dtypes
 
- #1723 	[PyTorch] Set sol constants as register_buffer(..., persistent=False) if available
 
- #1719 	[DFP] Slow Gather/Scatter
 
- #1717 	[Toolchain] Downgrade GCC to 10 to resolve libstdc++ issues
 
- #1716 	[PyTorch] torch.nanmean and torch.nansum
 
- #1715 	[HLIR] Remove Buffer -> Buffer without any other layers in between
 
- #1714 	[HLIR] Remove Gather with 0-sizes indices
 
- #1713 	[DNNL] Manylinux2.28 compiled DNNL requires libsupc++
 
- #1711 	[PyTorch] wrong scatter input/indices dimensions
 
- #1709 	[HLIR] SegFault BatchNormUpdate when Momentum == 1.0
 
- #1707 	[DNN] Make the Conv+BatchNorm Inference also work on VE and NVIDIA
 
- #1706 	[TF] Accuracy RegNet
 
- #1705 	[TF] Accuracy ResNetRS
 
- #1704 	[Pooling] Precompute values for Pooling::initOutputDims, so we can pass them to MergePooling
 
- #1703 	[TF] Unable to broadcast in sol_hlir_batch_norm_update
 
- #1702 	[TF] Expected shape (None, 112, 112, 64) but found (Dim(#0) ~= 32, 109, 109, 64) in DenseNet121
 
- #1700 	[HLIR] Prevent Outputs to be used for Persistent Copies
 
- #1699 	[Numpy] AttributeError: `np.mat` was removed in the NumPy 2.0 release. Use `np.asmatrix` instead.
 
- #1698 	[ToolChain] Check why some Python packages do not get installed in Docker
 
- #1697 	[ToolChain] Upgrade to newer ManyLinux as LLVM-VE fails on manylinux2014 container
 
- #1696 	[TF] BCE nan in training bs=10
 
- #1695 	[TF] keras.applications.ResNet50 error while parsing
 
- #1694 	[TF] TFRenderer broken
 
- #1693 	[TF] SOL does not return identical Keras Config
 
- #1691 	[DFP] Wrong SIMD Loop Collapsing
 
- #1689 	[DFP] InitAccessor Unmergeable not correctly working
 
- #1688 	[PyTorch] Initializing zero strided parameters from Numpy fail
 
- #1686 	[HLIR] Remove AxisOp and instead encode in Dims
 
- #1685 	[DFP] Incomplete loop fusion in ConvNext INF BS32
 
- #1682 	[JsonCPP] Disable Exceptions
 
- #1681 	[CURAND] torch.randint changed in 2.5?
 
- #1679 	[VEASL] Implement MT19937
 
- #1678 	[DNN] Evaluate if checking for min or avg is better
 
- #1677 	[DFP] Split LoopStacks if temporary data cannot be kept in on-chip memory
 
- #1676 	[YAAL] Simplify Cluster Calling Convention
 
- #1675 	[YAAL] Reduce spiking memory consumption
 
- #1673 	[HLIR] SDPAttention add EnableGQA
 
- #1672 	[DFP] Error in MaxPool2d training in PyTorch
 
- #1671 	[DFP] Wrong Clustering
 
- #1670 	[DFP] WeightedPooling DeConv seems to be broken e.g. in ShuffleNet training
 
- #1668 	[API] Enable user specified options to be passed on from sol.optimize to sol::compiler::Compiler
 
- #1667 	[VE] Can't install v0.6.1 with VEDA 2.2
 
- #1666 	[CMake] Fix make clean that deletes output folders for docs target
 
- #1665 	[PyTorch] Use fx.symbolic_trace instead of JIT script if possible
 
- #1664 	[Profiler/Runtime] Add Profiler Annotations to Runtime Hash
 
- #1663 	[TF] Issue when Masking(Masking(...))
 
- #1659 	[TF] Wrong Gradients for ReduceMax/Min
 
- #1658 	[TF] Error having View before Output in training
 
- #1657 	[License] Fix Date Format
 
- #1656 	[HLIR] Consider new DType encoding
 
- #1655 	[ISPC] Evaluate if `bool ? 1 : 0` or `!!bool` is faster when casting bool to float
 
- #1654	[HLIR] Incomplete Cluster Fusion
 
- #1649 	[HLIR] Add 'isRematerializable' to all deviceCopy Operations
 
- #1648 	[PyTorch] Reduce memory consumption in graph broken training
 
- #1647 	[YAAL] Fix Memory Leak of Persistent data not being freed in bwd pass
 
- #1646 	[PyTorch] Allow A.fwd, B.fwd, B.bwd, A.bwd execution sequences
 
- #1644 	[PyTorch] Test v2.5.0
 
- #1642 	[Sleef] Evaluate if erf(x) is faster than 1-erfc(x) and vice versa
 
- #1641 	[RNN] Store workspace data in inputLayout format
 
- #1640 	[TF] Missing 0-th output from ...
 
- #1639 	[RNN] Compilation error on X86 using softmax activation
 
- #1638 	[TF] 'KeyError' when parsing LSTM network
 
- #1637 	[Wrapper] "Can't find Tensor XXX in CTX" error when using models with _keras_mask attribute.
 
- #1636 	[Profiler] Performance callbacks added without SOL Profiler being activated
 
- #1635 	[TF] Remove :0 from input argument names
 
- #1634 	[Optimizer] add wrapper attributes to printSingnature
 
- #1633 	[VE] generate offloading wrapper library
 
- #1632 	[PyTorch] Support Dict, List and Tuple inputs in Script Parser
 
- #1631 	[NVIDIA] Enable to compile for multiple architectures
 
- #1630 	[VE] Improve Training
 
- #1629 	[YAAL] Find a way to properly free persistent data
 
- #1628 	[VE] Update rematerialization of sol_ctx on device
 
- #1627 	[VE] try catch does not catch exception correctly
 
- #1626 	[DNN/GEMM] Deprecated BNI*BNO and similiar calls
 
- #1625 	[DNN/RNN] Backport to new GEMM derive API
 
- #1624 	[PyTorch] register SOL automatically in PyTorch using entry points
 
- #1622 	[Sleef] Upgrade 3.7
 
- #1617 	[DFP] Make LoopStack::leafStacks a view
 
- #1616 	[DFP] Fix unparallizable outer loops
 
- #1613 	[HLIR] Deprecate Layer::remove
 
- #1612 	[DFP] Improve detection for "requires64Bit" in LoopAccessor
 
- #1611 	[HLIR/DNN] Remove GEMM::offsetB and GEMM::offsetC
 
- #1610 	[DFP] Race Condition in Scatter add axis==0
 
- #1609 	[HLIR] wrong Gradient "as_strided"
 
- #1608	[HLIR] Gradient of some Gather is again a Gather, but with reversed indices
 
- #1607 	[DNN] Allow BatchCount to be broadcastable
 
- #1606 	[HLIR] Illegal View transformation of ... in LEDModel
 
- #1605	[PyTorch] Gradients of torch.scatter_reduce and scatter_add are wrong
 
- #1604 	[HLIR] Gradients of Gathers are incorrect
 
- #1603 	[HLIR] Gradients of tensor assignment are incorrect
 
- #1602 	[HLIR] Transform Buffer->Copy->Buffer to Buffer->Buffer
 
- #1601 	[Docs] Add new features to documentation
 
- #1600 	[Algo] Don't store algos with VDims?
 
- #1599 	[Profiler] Output filename gets uppercasted
 
- #1598 	[VDims] Testcase gemm_perf compiles every single case, although VDims is activated
 
- #1595 	[HLIR] Remove Axes from IncdicesBase
 
- #1594 	[PyTorch] Linspace
 
- #1593 	[HLIR] Remove offset, step and loopSize. Rename data* to *
 
- #1590 	[NCC] Check why NCC always enables profiler
 
- #1589	[FFT] Make input copy a FFTW specific transformation
 
- #1588 	[ProgressBar] Progress Bar is still broken
 
- #1587 	[DNN] Some GEMM C kernels are wrong
 
- #1586 	[PyTorch] Improve handling of torch.SymInt
 
- #1585 	[PyTorch] FX Parser can duplicate parameters as it's not checking names properly
 
- #1584 	[HLIR] Encode Constant value in LayerOutput::defaultValue
 
- #1583 	[DFP] Allow constant LoopTypes
 
- #1582 	[HLIR] fix Issue_1208
 
- #1581 	[DFP] tensor[1, 2, 3] creates no-loop stack in bwd pass
 
- #1580 	[DFP] Avoid double broadcast e.g. in PY_Repeat
 
- #1579 	[DFP] AutoBroadcast
 
- #1578	[DFP] Fix Lookup Check in WhereSelect
 
- #1576 	[DFP] WhereSelect
 
- #1575 	[DNN] Enable WhereSelect to be applied to specific dimension
 
- #1573 	[HLIR] Why do VDims removed in transformer 'sequences'
 
- #1571 	[Wrapper] Handle VDims initialization in Wrapper
 
- #1570 	[HLIR] Deprecate Dim::setDataSize-like methods and replace with editDataSize-like methods
 
- #1569	[Numpy] Upgrade to 2.0 API
 
- #1568 	[PyTorch] Can we use Dynamo also for static models?
 
- #1567 	[HLIR] Unify Gather, Scatter, Roll, BufferOffset, BufferSlice, Reverse, Tile, ...
 
- #1563	[DFP] PY_Roll computes wrong gradients, when multiple roll write to the same input
 
- #1562	[DFP] roll(axis=None) causes no write loops in derive
 
- #1559 	[Docs] Make grey separator standard on all doc pages
 
- #1558 	[Installer] Use Version Padding for ~=
 
- #1556	[OpenSSL] Upgrade to 3.x
 
- #1553 	[HLIR] Merge Gather
 
- #1539	[VE] remove Handle deviceLists and allocate them instead in Module
 
- #1521 	[DFP] T2520_D0_Output is a register but shall use accessor in ...
 
- #1515 	[DFP] Correctly implement Grouped Conv
 
- #1513 	[NVIDIA] CUDA Graphs
 
- #1509 	[DFP] Don't collapse gather loops
 
- #1506 	[DFP] Bloom accuracy on AVX512
 
- #1495 	[Parser] Add numpy-Advanced Indexing style
 
- #1479 	[AutoTuner] Cache Algo's per session if they have constraints
 
- #1460 	[VDims] Enable GEMM vdims for channels
 
- #1449 	[PyTorch] Add Loss Functions to Parser/HLIR
 
- #1444 	[Profiler] Add D2H and H2D memcopies
 
- #1434	[CUDNN] Graph API
 
- #1414 	[CUBLAS] Investigate cublasLT tuning options
 
- #1401 	[VE] Performance
 
- #1390 	[HLIR] Improve VDims for Views
 
- #1352 	[YAAL] Improve Error Handling
 
- #1237 	[HLIR] Faulty Clustering
 
- #1173 	[DFP] remove DFP::schedule and instead use the LoopFusion structure of DFP::optimizeLoops to determine execution schedule
 
- #1141 	[HLIR] GEMM optimization for i == 1 || o == 1 not working for backward pass weight-style GEMM
 
- #1139 	[HLIR] Remove Immediate Layer Fusion in HLIR
 
- #1108 	[Distributed] Changes required for multi-node distributed computing
 
- #928 	[DFP] Narrow, Repeat, Tile: unvectorized LoopStack found
 
- #808 	[API] Improve Error Messages
 
- #786 	[HLIR] allow tensor to be casted into complex tensor
 
- #767 	[DFP] Transform Cores to CoresSIMD, if the sub-SIMD don't share data through a Cache
 
- #298 	[PyTorch] Can't use sliced Tensor assignment
 
- #288	[VEDNN] add static lib for deployment
 
- #234 	[Jupyter] Can we signal Jupyter when SOL has crashed?
 
- #168 	[HLIR] Enable Replay in HLIR
 
 
 |