28.02.2025 v0.7.0 Docs |
Highlights
- NEC SX-Aurora (VE) training support. Please read the NEC SX-Aurora section for device specific options.
- Better control over model determinism and performance.
- Performance optimizations for X86, NVIDIA and NEC SX-Aurora devices.
- Improved Gather/Scatter implementations that now support advanced slicing modes, e.g.,
torch.index_put .
- Improved automatic VDim detection.
- Improved low precision handling (on supported devices).
Closed Issues
- #1754 [Wrapper] Enable Non-Output Models
- #1746 [DFP] Unable to subtract (802816 * #0 ~ 802816) from (0 + (1024 * #0 ~ 1024)) in debugMemory
- #1744 [DFP] Faulty Scheduling
- #1742 [DFP] Performance
- #1741 [C-API] Remove Tungl dependency
- #1740 [PyTorch] Wrong result torch.Tensor.scatter_add_
- #1739 [PyTorch] Enable sol.optimize(model.member_function)
- #1737 [TF] Fix cloning of Keras Legacy layers
- #1736 [HLIR] Fix Pooling::mergePadding for MaxPooling
- #1735 [NVIDIA] Add API to increase/fetch/free a Handle maintained Workspace.
- #1733 [Keras] Add _keras_logits to SoftMax and Sigmoid
- #1732 [CUDNN] Performance Regression in DeConv/AlexNet
- #1731 [CUBLAS] Autotune if using TF32 is better or not and let SOL adjust the determinism accordingly!
- #1730 [JSON] Compress DType
- #1729 [DFP] Faulty broadcasting in Gather case
- #1728 [CUDNN] add v9 bias/postops
- #1726 [Compiler/DFP] add `gen::Func post` to all renderSize/Shape methods to allow easier `uniform` dtypes
- #1723 [PyTorch] Set sol constants as register_buffer(..., persistent=False) if available
- #1719 [DFP] Slow Gather/Scatter
- #1717 [Toolchain] Downgrade GCC to 10 to resolve libstdc++ issues
- #1716 [PyTorch] torch.nanmean and torch.nansum
- #1715 [HLIR] Remove Buffer -> Buffer without any other layers in between
- #1714 [HLIR] Remove Gather with 0-sizes indices
- #1713 [DNNL] Manylinux2.28 compiled DNNL requires libsupc++
- #1711 [PyTorch] wrong scatter input/indices dimensions
- #1709 [HLIR] SegFault BatchNormUpdate when Momentum == 1.0
- #1707 [DNN] Make the Conv+BatchNorm Inference also work on VE and NVIDIA
- #1706 [TF] Accuracy RegNet
- #1705 [TF] Accuracy ResNetRS
- #1704 [Pooling] Precompute values for Pooling::initOutputDims, so we can pass them to MergePooling
- #1703 [TF] Unable to broadcast in sol_hlir_batch_norm_update
- #1702 [TF] Expected shape (None, 112, 112, 64) but found (Dim(#0) ~= 32, 109, 109, 64) in DenseNet121
- #1700 [HLIR] Prevent Outputs to be used for Persistent Copies
- #1699 [Numpy] AttributeError: `np.mat` was removed in the NumPy 2.0 release. Use `np.asmatrix` instead.
- #1698 [ToolChain] Check why some Python packages do not get installed in Docker
- #1697 [ToolChain] Upgrade to newer ManyLinux as LLVM-VE fails on manylinux2014 container
- #1696 [TF] BCE nan in training bs=10
- #1695 [TF] keras.applications.ResNet50 error while parsing
- #1694 [TF] TFRenderer broken
- #1693 [TF] SOL does not return identical Keras Config
- #1691 [DFP] Wrong SIMD Loop Collapsing
- #1689 [DFP] InitAccessor Unmergeable not correctly working
- #1688 [PyTorch] Initializing zero strided parameters from Numpy fail
- #1686 [HLIR] Remove AxisOp and instead encode in Dims
- #1685 [DFP] Incomplete loop fusion in ConvNext INF BS32
- #1682 [JsonCPP] Disable Exceptions
- #1681 [CURAND] torch.randint changed in 2.5?
- #1679 [VEASL] Implement MT19937
- #1678 [DNN] Evaluate if checking for min or avg is better
- #1677 [DFP] Split LoopStacks if temporary data cannot be kept in on-chip memory
- #1676 [YAAL] Simplify Cluster Calling Convention
- #1675 [YAAL] Reduce spiking memory consumption
- #1673 [HLIR] SDPAttention add EnableGQA
- #1672 [DFP] Error in MaxPool2d training in PyTorch
- #1671 [DFP] Wrong Clustering
- #1670 [DFP] WeightedPooling DeConv seems to be broken e.g. in ShuffleNet training
- #1668 [API] Enable user specified options to be passed on from sol.optimize to sol::compiler::Compiler
- #1667 [VE] Can't install v0.6.1 with VEDA 2.2
- #1666 [CMake] Fix make clean that deletes output folders for docs target
- #1665 [PyTorch] Use fx.symbolic_trace instead of JIT script if possible
- #1664 [Profiler/Runtime] Add Profiler Annotations to Runtime Hash
- #1663 [TF] Issue when Masking(Masking(...))
- #1659 [TF] Wrong Gradients for ReduceMax/Min
- #1658 [TF] Error having View before Output in training
- #1657 [License] Fix Date Format
- #1656 [HLIR] Consider new DType encoding
- #1655 [ISPC] Evaluate if `bool ? 1 : 0` or `!!bool` is faster when casting bool to float
- #1654 [HLIR] Incomplete Cluster Fusion
- #1649 [HLIR] Add 'isRematerializable' to all deviceCopy Operations
- #1648 [PyTorch] Reduce memory consumption in graph broken training
- #1647 [YAAL] Fix Memory Leak of Persistent data not being freed in bwd pass
- #1646 [PyTorch] Allow A.fwd, B.fwd, B.bwd, A.bwd execution sequences
- #1644 [PyTorch] Test v2.5.0
- #1642 [Sleef] Evaluate if erf(x) is faster than 1-erfc(x) and vice versa
- #1641 [RNN] Store workspace data in inputLayout format
- #1640 [TF] Missing 0-th output from ...
- #1639 [RNN] Compilation error on X86 using softmax activation
- #1638 [TF] 'KeyError' when parsing LSTM network
- #1637 [Wrapper] "Can't find Tensor XXX in CTX" error when using models with _keras_mask attribute.
- #1636 [Profiler] Performance callbacks added without SOL Profiler being activated
- #1635 [TF] Remove :0 from input argument names
- #1634 [Optimizer] add wrapper attributes to printSingnature
- #1633 [VE] generate offloading wrapper library
- #1632 [PyTorch] Support Dict, List and Tuple inputs in Script Parser
- #1631 [NVIDIA] Enable to compile for multiple architectures
- #1630 [VE] Improve Training
- #1629 [YAAL] Find a way to properly free persistent data
- #1628 [VE] Update rematerialization of sol_ctx on device
- #1627 [VE] try catch does not catch exception correctly
- #1626 [DNN/GEMM] Deprecated BNI*BNO and similiar calls
- #1625 [DNN/RNN] Backport to new GEMM derive API
- #1624 [PyTorch] register SOL automatically in PyTorch using entry points
- #1622 [Sleef] Upgrade 3.7
- #1617 [DFP] Make LoopStack::leafStacks a view
- #1616 [DFP] Fix unparallizable outer loops
- #1613 [HLIR] Deprecate Layer::remove
- #1612 [DFP] Improve detection for "requires64Bit" in LoopAccessor
- #1611 [HLIR/DNN] Remove GEMM::offsetB and GEMM::offsetC
- #1610 [DFP] Race Condition in Scatter add axis==0
- #1609 [HLIR] wrong Gradient "as_strided"
- #1608 [HLIR] Gradient of some Gather is again a Gather, but with reversed indices
- #1607 [DNN] Allow BatchCount to be broadcastable
- #1606 [HLIR] Illegal View transformation of ... in LEDModel
- #1605 [PyTorch] Gradients of torch.scatter_reduce and scatter_add are wrong
- #1604 [HLIR] Gradients of Gathers are incorrect
- #1603 [HLIR] Gradients of tensor assignment are incorrect
- #1602 [HLIR] Transform Buffer->Copy->Buffer to Buffer->Buffer
- #1601 [Docs] Add new features to documentation
- #1600 [Algo] Don't store algos with VDims?
- #1599 [Profiler] Output filename gets uppercasted
- #1598 [VDims] Testcase gemm_perf compiles every single case, although VDims is activated
- #1595 [HLIR] Remove Axes from IncdicesBase
- #1594 [PyTorch] Linspace
- #1593 [HLIR] Remove offset, step and loopSize. Rename data* to *
- #1590 [NCC] Check why NCC always enables profiler
- #1589 [FFT] Make input copy a FFTW specific transformation
- #1588 [ProgressBar] Progress Bar is still broken
- #1587 [DNN] Some GEMM C kernels are wrong
- #1586 [PyTorch] Improve handling of torch.SymInt
- #1585 [PyTorch] FX Parser can duplicate parameters as it's not checking names properly
- #1584 [HLIR] Encode Constant value in LayerOutput::defaultValue
- #1583 [DFP] Allow constant LoopTypes
- #1582 [HLIR] fix Issue_1208
- #1581 [DFP] tensor[1, 2, 3] creates no-loop stack in bwd pass
- #1580 [DFP] Avoid double broadcast e.g. in PY_Repeat
- #1579 [DFP] AutoBroadcast
- #1578 [DFP] Fix Lookup Check in WhereSelect
- #1576 [DFP] WhereSelect
- #1575 [DNN] Enable WhereSelect to be applied to specific dimension
- #1573 [HLIR] Why do VDims removed in transformer 'sequences'
- #1571 [Wrapper] Handle VDims initialization in Wrapper
- #1570 [HLIR] Deprecate Dim::setDataSize-like methods and replace with editDataSize-like methods
- #1569 [Numpy] Upgrade to 2.0 API
- #1568 [PyTorch] Can we use Dynamo also for static models?
- #1567 [HLIR] Unify Gather, Scatter, Roll, BufferOffset, BufferSlice, Reverse, Tile, ...
- #1563 [DFP] PY_Roll computes wrong gradients, when multiple roll write to the same input
- #1562 [DFP] roll(axis=None) causes no write loops in derive
- #1559 [Docs] Make grey separator standard on all doc pages
- #1558 [Installer] Use Version Padding for ~=
- #1556 [OpenSSL] Upgrade to 3.x
- #1553 [HLIR] Merge Gather
- #1539 [VE] remove Handle deviceLists and allocate them instead in Module
- #1521 [DFP] T2520_D0_Output is a register but shall use accessor in ...
- #1515 [DFP] Correctly implement Grouped Conv
- #1513 [NVIDIA] CUDA Graphs
- #1509 [DFP] Don't collapse gather loops
- #1506 [DFP] Bloom accuracy on AVX512
- #1495 [Parser] Add numpy-Advanced Indexing style
- #1479 [AutoTuner] Cache Algo's per session if they have constraints
- #1460 [VDims] Enable GEMM vdims for channels
- #1449 [PyTorch] Add Loss Functions to Parser/HLIR
- #1444 [Profiler] Add D2H and H2D memcopies
- #1434 [CUDNN] Graph API
- #1414 [CUBLAS] Investigate cublasLT tuning options
- #1401 [VE] Performance
- #1390 [HLIR] Improve VDims for Views
- #1352 [YAAL] Improve Error Handling
- #1237 [HLIR] Faulty Clustering
- #1173 [DFP] remove DFP::schedule and instead use the LoopFusion structure of DFP::optimizeLoops to determine execution schedule
- #1141 [HLIR] GEMM optimization for i == 1 || o == 1 not working for backward pass weight-style GEMM
- #1139 [HLIR] Remove Immediate Layer Fusion in HLIR
- #1108 [Distributed] Changes required for multi-node distributed computing
- #928 [DFP] Narrow, Repeat, Tile: unvectorized LoopStack found
- #808 [API] Improve Error Messages
- #786 [HLIR] allow tensor to be casted into complex tensor
- #767 [DFP] Transform Cores to CoresSIMD, if the sub-SIMD don't share data through a Cache
- #298 [PyTorch] Can't use sliced Tensor assignment
- #288 [VEDNN] add static lib for deployment
- #234 [Jupyter] Can we signal Jupyter when SOL has crashed?
- #168 [HLIR] Enable Replay in HLIR
|