Giant-Scale Matrix Factorization on TPUs



Matrix factorization is without doubt one of the oldest, but nonetheless broadly used, strategies for studying methods to advocate gadgets reminiscent of songs or films from person rankings. In its primary type, it approximates a big, sparse (i.e., principally empty) matrix of user-item interactions with a product of two smaller, denser matrices representing discovered merchandise and person options. These dense matrices, in flip, can be utilized to advocate gadgets to a person with which they have not interacted earlier than.

Regardless of its algorithmic simplicity, matrix factorization can nonetheless obtain aggressive efficiency in recommender benchmarks. Alternating least squares (ALS), and particularly its implicit variation, is a elementary algorithm to be taught the parameters of matrix factorization. ALS is understood for its excessive effectivity as a result of it scales linearly within the variety of rows, columns and non-zeros. Therefore, this algorithm may be very effectively fitted to large-scale challenges. However, for very massive real-world matrix factorization datasets, a single machine implementation wouldn’t suffice, and so, it will require a big distributed system. A lot of the distributed implementations of matrix factorization that make use of ALS leverage off-the-shelf CPU units, and rightfully so, because of the inherently sparse nature of the issue (the enter matrix is usually empty).

Then again, current success of deep studying, which has exhibited rising computational capability, has spurred a brand new wave of analysis and progress on {hardware} accelerators reminiscent of Tensor Processing Items (TPUs). TPUs afford area particular {hardware} speedups, particularly to be used circumstances like deep studying, which entails a lot of dense matrix multiplications. Specifically, they permit important speedups for conventional data-parallel workloads, reminiscent of coaching fashions with Stochastic Gradient Descent (SGD) in SPMD (single program a number of knowledge) trend. The SPMD strategy has gained reputation in computations like coaching neural networks with gradient descent algorithms, and can be utilized for each data-parallel and model-parallel computations, the place we distribute parameters of the mannequin throughout out there units. However, whereas TPUs have been enormously engaging for strategies based mostly on SGD, it isn’t instantly clear if a excessive efficiency implementation of ALS, which requires a lot of distributed sparse matrix multiplies, might be developed for a large-scale cluster of TPU units.

In “ALX: Giant Scale Matrix Factorization on TPUs”, we discover a distributed ALS design that makes environment friendly use of the TPU structure and may scale effectively to matrix factorization issues of the order of billions of rows and columns by scaling the variety of out there TPU cores. The strategy we suggest leverages a mix of mannequin and knowledge parallelism, the place every TPU core each shops a portion of the embedding desk and trains over a singular slice of information, grouped in mini-batches. To be able to spur future analysis on large-scale matrix factorization strategies and for instance the scalability properties of our personal implementation, we additionally constructed and launched an actual world internet hyperlink prediction dataset referred to as WebGraph.

The determine reveals the move of information and computation by the ALX framework on TPU units. Much like SGD-based coaching procedures, every TPU core performs similar computation for its personal batch of information in SPMD trend, which permits for synchronous computation in parallel on a number of TPU cores. Every TPU begins with gathering all of the related merchandise embeddings within the Sharded Collect stage. These materialized embeddings are used to resolve for person embeddings that are scattered to the related shard of the embedding desk within the Sharded Scatter stage.

Dense Batching for Improved Effectivity
We designed ALX particularly for TPUs, exploiting distinctive properties of TPU structure whereas overcoming a couple of attention-grabbing limitations. As an illustration, every TPU core has restricted reminiscence and restricts all tensors to have a static form, however every instance in a mini-batch can have a wildly various variety of gadgets (i.e., inputs might be lengthy and sparse). To resolve this, we break exceedingly lengthy examples into a number of smaller examples of the identical form, a course of referred to as dense batching. Extra particulars about dense batching might be present in our paper.

Illustrating instance of how sparse batches are densified to extend effectivity on TPUs.

Uniform Sharding of Embedding Tables
With the batching downside solved, we subsequent wish to factorize a sparse matrix into two dense embedding matrices (e.g., person and merchandise embeddings) such that the ensuing dot product of embeddings approximate the unique sparse matrix — this helps us infer predictions for all the positions from the unique matrix, together with people who had been empty, which can be utilized to advocate gadgets with which customers haven’t interacted. Each the ensuing embedding tables (W and H within the determine beneath) can probably be too massive to slot in a single TPU core, thus requiring a distributed coaching setup for many large-scale use circumstances.

Most earlier makes an attempt of distributed matrix factorization use a parameter server structure the place the mannequin parameters are saved on extremely out there servers, and the coaching knowledge is processed in parallel by staff which can be solely chargeable for the training job. In our case, since every TPU core has similar compute and reminiscence, it is wasteful to solely use both reminiscence for storing mannequin parameters or compute for coaching. Thus, we designed our system such that every core is used to do each.

Illustrative instance of factorizing a sparse matrix Y into two dense embedding matrices W and H.

In ALX, we uniformly divide each embedding tables, thus absolutely exploiting each the scale of distributed reminiscence out there and the devoted low-latency interconnects between TPUs. That is extremely environment friendly for very massive embedding tables and ends in good efficiency for distributed collect and scatter operations.

Uniform sharding of each embedding tables (W and H) throughout TPU cores (in blue).

Since potential purposes might contain very massive knowledge units, scalability is probably an necessary alternative for development in matrix factorization. To that finish, we additionally launch a big real-world internet hyperlink prediction dataset referred to as WebGraph. This dataset might be simply modeled as a matrix factorization downside the place rows and columns are supply and vacation spot hyperlinks, respectively, and the duty is to foretell vacation spot hyperlinks from every supply hyperlink. We use WebGraph for instance the scaling properties of ALX.

The WebGraph dataset was generated from a single crawl carried out by CommonCrawl in 2021 the place we strip the whole lot and preserve solely the link->outlinks knowledge. For the reason that efficiency of a factorization methodology is dependent upon the properties of the underlying graph, we created six variations of WebGraph, every various within the sparsity sample and locale, to check how effectively ALS performs on every.

  • To review locale-specific graphs, we filter based mostly on two high degree domains: ‘de’ and ‘in’, every producing a graph with an order of magnitude fewer nodes.
  • These graphs can nonetheless have arbitrary sparsity patterns and dangling hyperlinks. Thus we additional filter the nodes in every graph to have a minimal of both 10 or 50 inlinks and outlinks.

For simple entry, we have now made these out there as a Tensorflow Dataset bundle. For reference, the largest model, WebGraph-sparse, has greater than 365M nodes and 30B edges. We create and publish each coaching and testing splits for analysis functions.

We fastidiously tune the system and high quality parameters of ALX. Primarily based on our observations associated to precision and selection of linear solvers. ​​We noticed that by fastidiously deciding on the precision for storage of the embedding tables (bfloat16) and for the enter to the linear solvers (float32), we had been in a position to halve the reminiscence required for the embeddings whereas nonetheless avoiding issues arising from decrease precision values throughout the resolve stage. For our linear solvers, we chosen conjugate gradients, which we discovered to be the quickest throughout the board on TPUs. We use embeddings of dimension 128 and prepare the mannequin for 16 epochs. In our expertise, hyperparameter tuning over each norm penalty (λ) and unobserved weight (α) has been indispensable for good recall metrics as proven within the desk beneath.

Outcomes obtained by working ALX on all variations of WebGraph dataset. Recall values of 1.0 denote excellent recall.

Scaling Evaluation
For the reason that enter knowledge are processed in parallel throughout TPU cores, growing the variety of cores decreases coaching time, ideally in a linear trend. However on the similar time, a bigger variety of cores requires extra community communication (because of the sharded embedding tables). Due to high-speed interconnects, this overhead might be negligible for a small variety of cores, however because the variety of cores will increase, the overhead ultimately slows down the perfect linear scaling.

To be able to affirm our speculation, we analyze scaling properties of the 4 greatest WebGraph variants when it comes to coaching time as we enhance the variety of out there TPU cores. As proven beneath, even empirically, we do observe the anticipated linear lower in coaching time as much as a candy spot, after which the community overhead slows the decline.

Scaling evaluation of working time because the variety of TPU cores are elevated. Every determine plots the time taken to coach for one epoch in seconds.

For simple entry and reproducibility, the ALX code is open-sourced and might be simply run on Google Cloud. The truth is, we illustrate {that a} sparse matrix like WebGraph-dense of measurement 135M x 135M (with 22B edges) might be factorized in a colab related to eight TPU cores in lower than a day. We now have designed the ALX framework with scalability in thoughts. With 256 TPU cores, one epoch of the biggest WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix) takes round 20 minutes to complete (5.5 hours for the entire coaching run). The ultimate mannequin has round 100B parameters. We hope that the ALX and WebGraph will probably be helpful to each researchers and practitioners working in these fields. The code for ALX might be discovered right here on github!

The core staff contains Steffen Rendle, Walid Krichene and Li Zhang. We thank many Google colleagues for serving to at varied levels of this mission. Specifically, we’re grateful to the JAX staff for quite a few discussions, particularly James Bradbury and Skye Wanderman-Milne; Blake Hechtman for assist with XLA and Rasmus Larsen for helpful discussions about efficiency of linear solvers on TPUs. Lastly, we’re additionally grateful to Nicolas Mayoraz and John Anderson for offering helpful suggestions.