v0.7 Fafnir

Version/DateChanges
24.04.2025
v0.7.2
Docs

Bugfix release upgrading to changed compile flags for PyTorch v2.7.0 integration.

Closed Issues

  • #1785 [PyTorch] Fix module for v2.7.0, which requires CXX11 ABI
07.03.2025
v0.7.1
Docs

Bugfix release fixing library loading issues on some systems and improving compatibility to PyTorch v2.6.0.

Closed Issues

  • #1776 [PyTorch] Unable to find shape in tensor
  • #1775 [Tungl] Prevent SOL from loading /opt/nec/veos/lib64/libtungl.so.0
  • #1768 [PyTorch] Missing torch.ops.higher_order.autograd_function_apply
28.02.2025
v0.7.0
Docs

Highlights

  • NEC SX-Aurora (VE) training support. Please read the NEC SX-Aurora section for device specific options.
  • Better control over model determinism and performance.
  • Performance optimizations for X86, NVIDIA and NEC SX-Aurora devices.
  • Improved Gather/Scatter implementations that now support advanced slicing modes, e.g., torch.index_put.
  • Improved automatic VDim detection.
  • Improved low precision handling (on supported devices).

Closed Issues

  • #1754 [Wrapper] Enable Non-Output Models
  • #1746 [DFP] Unable to subtract (802816 * #0 ~ 802816) from (0 + (1024 * #0 ~ 1024)) in debugMemory
  • #1744 [DFP] Faulty Scheduling
  • #1742 [DFP] Performance
  • #1741 [C-API] Remove Tungl dependency
  • #1740 [PyTorch] Wrong result torch.Tensor.scatter_add_
  • #1739 [PyTorch] Enable sol.optimize(model.member_function)
  • #1737 [TF] Fix cloning of Keras Legacy layers
  • #1736 [HLIR] Fix Pooling::mergePadding for MaxPooling
  • #1735 [NVIDIA] Add API to increase/fetch/free a Handle maintained Workspace.
  • #1733 [Keras] Add _keras_logits to SoftMax and Sigmoid
  • #1732 [CUDNN] Performance Regression in DeConv/AlexNet
  • #1731 [CUBLAS] Autotune if using TF32 is better or not and let SOL adjust the determinism accordingly!
  • #1730 [JSON] Compress DType
  • #1729 [DFP] Faulty broadcasting in Gather case
  • #1728 [CUDNN] add v9 bias/postops
  • #1726 [Compiler/DFP] add `gen::Func post` to all renderSize/Shape methods to allow easier `uniform` dtypes
  • #1723 [PyTorch] Set sol constants as register_buffer(..., persistent=False) if available
  • #1719 [DFP] Slow Gather/Scatter
  • #1717 [Toolchain] Downgrade GCC to 10 to resolve libstdc++ issues
  • #1716 [PyTorch] torch.nanmean and torch.nansum
  • #1715 [HLIR] Remove Buffer -> Buffer without any other layers in between
  • #1714 [HLIR] Remove Gather with 0-sizes indices
  • #1713 [DNNL] Manylinux2.28 compiled DNNL requires libsupc++
  • #1711 [PyTorch] wrong scatter input/indices dimensions
  • #1709 [HLIR] SegFault BatchNormUpdate when Momentum == 1.0
  • #1707 [DNN] Make the Conv+BatchNorm Inference also work on VE and NVIDIA
  • #1706 [TF] Accuracy RegNet
  • #1705 [TF] Accuracy ResNetRS
  • #1704 [Pooling] Precompute values for Pooling::initOutputDims, so we can pass them to MergePooling
  • #1703 [TF] Unable to broadcast in sol_hlir_batch_norm_update
  • #1702 [TF] Expected shape (None, 112, 112, 64) but found (Dim(#0) ~= 32, 109, 109, 64) in DenseNet121
  • #1700 [HLIR] Prevent Outputs to be used for Persistent Copies
  • #1699 [Numpy] AttributeError: `np.mat` was removed in the NumPy 2.0 release. Use `np.asmatrix` instead.
  • #1698 [ToolChain] Check why some Python packages do not get installed in Docker
  • #1697 [ToolChain] Upgrade to newer ManyLinux as LLVM-VE fails on manylinux2014 container
  • #1696 [TF] BCE nan in training bs=10
  • #1695 [TF] keras.applications.ResNet50 error while parsing
  • #1694 [TF] TFRenderer broken
  • #1693 [TF] SOL does not return identical Keras Config
  • #1691 [DFP] Wrong SIMD Loop Collapsing
  • #1689 [DFP] InitAccessor Unmergeable not correctly working
  • #1688 [PyTorch] Initializing zero strided parameters from Numpy fail
  • #1686 [HLIR] Remove AxisOp and instead encode in Dims
  • #1685 [DFP] Incomplete loop fusion in ConvNext INF BS32
  • #1682 [JsonCPP] Disable Exceptions
  • #1681 [CURAND] torch.randint changed in 2.5?
  • #1679 [VEASL] Implement MT19937
  • #1678 [DNN] Evaluate if checking for min or avg is better
  • #1677 [DFP] Split LoopStacks if temporary data cannot be kept in on-chip memory
  • #1676 [YAAL] Simplify Cluster Calling Convention
  • #1675 [YAAL] Reduce spiking memory consumption
  • #1673 [HLIR] SDPAttention add EnableGQA
  • #1672 [DFP] Error in MaxPool2d training in PyTorch
  • #1671 [DFP] Wrong Clustering
  • #1670 [DFP] WeightedPooling DeConv seems to be broken e.g. in ShuffleNet training
  • #1668 [API] Enable user specified options to be passed on from sol.optimize to sol::compiler::Compiler
  • #1667 [VE] Can't install v0.6.1 with VEDA 2.2
  • #1666 [CMake] Fix make clean that deletes output folders for docs target
  • #1665 [PyTorch] Use fx.symbolic_trace instead of JIT script if possible
  • #1664 [Profiler/Runtime] Add Profiler Annotations to Runtime Hash
  • #1663 [TF] Issue when Masking(Masking(...))
  • #1659 [TF] Wrong Gradients for ReduceMax/Min
  • #1658 [TF] Error having View before Output in training
  • #1657 [License] Fix Date Format
  • #1656 [HLIR] Consider new DType encoding
  • #1655 [ISPC] Evaluate if `bool ? 1 : 0` or `!!bool` is faster when casting bool to float
  • #1654 [HLIR] Incomplete Cluster Fusion
  • #1649 [HLIR] Add 'isRematerializable' to all deviceCopy Operations
  • #1648 [PyTorch] Reduce memory consumption in graph broken training
  • #1647 [YAAL] Fix Memory Leak of Persistent data not being freed in bwd pass
  • #1646 [PyTorch] Allow A.fwd, B.fwd, B.bwd, A.bwd execution sequences
  • #1644 [PyTorch] Test v2.5.0
  • #1642 [Sleef] Evaluate if erf(x) is faster than 1-erfc(x) and vice versa
  • #1641 [RNN] Store workspace data in inputLayout format
  • #1640 [TF] Missing 0-th output from ...
  • #1639 [RNN] Compilation error on X86 using softmax activation
  • #1638 [TF] 'KeyError' when parsing LSTM network
  • #1637 [Wrapper] "Can't find Tensor XXX in CTX" error when using models with _keras_mask attribute.
  • #1636 [Profiler] Performance callbacks added without SOL Profiler being activated
  • #1635 [TF] Remove :0 from input argument names
  • #1634 [Optimizer] add wrapper attributes to printSingnature
  • #1633 [VE] generate offloading wrapper library
  • #1632 [PyTorch] Support Dict, List and Tuple inputs in Script Parser
  • #1631 [NVIDIA] Enable to compile for multiple architectures
  • #1630 [VE] Improve Training
  • #1629 [YAAL] Find a way to properly free persistent data
  • #1628 [VE] Update rematerialization of sol_ctx on device
  • #1627 [VE] try catch does not catch exception correctly
  • #1626 [DNN/GEMM] Deprecated BNI*BNO and similiar calls
  • #1625 [DNN/RNN] Backport to new GEMM derive API
  • #1624 [PyTorch] register SOL automatically in PyTorch using entry points
  • #1622 [Sleef] Upgrade 3.7
  • #1617 [DFP] Make LoopStack::leafStacks a view
  • #1616 [DFP] Fix unparallizable outer loops
  • #1613 [HLIR] Deprecate Layer::remove
  • #1612 [DFP] Improve detection for "requires64Bit" in LoopAccessor
  • #1611 [HLIR/DNN] Remove GEMM::offsetB and GEMM::offsetC
  • #1610 [DFP] Race Condition in Scatter add axis==0
  • #1609 [HLIR] wrong Gradient "as_strided"
  • #1608 [HLIR] Gradient of some Gather is again a Gather, but with reversed indices
  • #1607 [DNN] Allow BatchCount to be broadcastable
  • #1606 [HLIR] Illegal View transformation of ... in LEDModel
  • #1605 [PyTorch] Gradients of torch.scatter_reduce and scatter_add are wrong
  • #1604 [HLIR] Gradients of Gathers are incorrect
  • #1603 [HLIR] Gradients of tensor assignment are incorrect
  • #1602 [HLIR] Transform Buffer->Copy->Buffer to Buffer->Buffer
  • #1601 [Docs] Add new features to documentation
  • #1600 [Algo] Don't store algos with VDims?
  • #1599 [Profiler] Output filename gets uppercasted
  • #1598 [VDims] Testcase gemm_perf compiles every single case, although VDims is activated
  • #1595 [HLIR] Remove Axes from IncdicesBase
  • #1594 [PyTorch] Linspace
  • #1593 [HLIR] Remove offset, step and loopSize. Rename data* to *
  • #1590 [NCC] Check why NCC always enables profiler
  • #1589 [FFT] Make input copy a FFTW specific transformation
  • #1588 [ProgressBar] Progress Bar is still broken
  • #1587 [DNN] Some GEMM C kernels are wrong
  • #1586 [PyTorch] Improve handling of torch.SymInt
  • #1585 [PyTorch] FX Parser can duplicate parameters as it's not checking names properly
  • #1584 [HLIR] Encode Constant value in LayerOutput::defaultValue
  • #1583 [DFP] Allow constant LoopTypes
  • #1582 [HLIR] fix Issue_1208
  • #1581 [DFP] tensor[1, 2, 3] creates no-loop stack in bwd pass
  • #1580 [DFP] Avoid double broadcast e.g. in PY_Repeat
  • #1579 [DFP] AutoBroadcast
  • #1578 [DFP] Fix Lookup Check in WhereSelect
  • #1576 [DFP] WhereSelect
  • #1575 [DNN] Enable WhereSelect to be applied to specific dimension
  • #1573 [HLIR] Why do VDims removed in transformer 'sequences'
  • #1571 [Wrapper] Handle VDims initialization in Wrapper
  • #1570 [HLIR] Deprecate Dim::setDataSize-like methods and replace with editDataSize-like methods
  • #1569 [Numpy] Upgrade to 2.0 API
  • #1568 [PyTorch] Can we use Dynamo also for static models?
  • #1567 [HLIR] Unify Gather, Scatter, Roll, BufferOffset, BufferSlice, Reverse, Tile, ...
  • #1563 [DFP] PY_Roll computes wrong gradients, when multiple roll write to the same input
  • #1562 [DFP] roll(axis=None) causes no write loops in derive
  • #1559 [Docs] Make grey separator standard on all doc pages
  • #1558 [Installer] Use Version Padding for ~=
  • #1556 [OpenSSL] Upgrade to 3.x
  • #1553 [HLIR] Merge Gather
  • #1539 [VE] remove Handle deviceLists and allocate them instead in Module
  • #1521 [DFP] T2520_D0_Output is a register but shall use accessor in ...
  • #1515 [DFP] Correctly implement Grouped Conv
  • #1513 [NVIDIA] CUDA Graphs
  • #1509 [DFP] Don't collapse gather loops
  • #1506 [DFP] Bloom accuracy on AVX512
  • #1495 [Parser] Add numpy-Advanced Indexing style
  • #1479 [AutoTuner] Cache Algo's per session if they have constraints
  • #1460 [VDims] Enable GEMM vdims for channels
  • #1449 [PyTorch] Add Loss Functions to Parser/HLIR
  • #1444 [Profiler] Add D2H and H2D memcopies
  • #1434 [CUDNN] Graph API
  • #1414 [CUBLAS] Investigate cublasLT tuning options
  • #1401 [VE] Performance
  • #1390 [HLIR] Improve VDims for Views
  • #1352 [YAAL] Improve Error Handling
  • #1237 [HLIR] Faulty Clustering
  • #1173 [DFP] remove DFP::schedule and instead use the LoopFusion structure of DFP::optimizeLoops to determine execution schedule
  • #1141 [HLIR] GEMM optimization for i == 1 || o == 1 not working for backward pass weight-style GEMM
  • #1139 [HLIR] Remove Immediate Layer Fusion in HLIR
  • #1108 [Distributed] Changes required for multi-node distributed computing
  • #928 [DFP] Narrow, Repeat, Tile: unvectorized LoopStack found
  • #808 [API] Improve Error Messages
  • #786 [HLIR] allow tensor to be casted into complex tensor
  • #767 [DFP] Transform Cores to CoresSIMD, if the sub-SIMD don't share data through a Cache
  • #298 [PyTorch] Can't use sliced Tensor assignment
  • #288 [VEDNN] add static lib for deployment
  • #234 [Jupyter] Can we signal Jupyter when SOL has crashed?
  • #168 [HLIR] Enable Replay in HLIR