Skip to main content

Jehan-Francois Paris

University of Houston, ComputerScience, Faculty Member

Followers

72

Following

22

Co-authors

12

Public Views

Address: Department of Computer Science
Mail Code 3010
Houston, TX 77204

less

Darrell D E Long

University of California, Santa Cruz

Thomas J E Schwarz

Marquette University

University of California, Santa Cruz

University of Neuchâtel

Veronica Estrada-Galinanes

University of Neuchâtel

InterestsView All (7)

Uploads

Papers by Jehan-Francois Paris

Using storage class memories to increase the reliability of two-dimensional RAID arrays

by Ahmed Amer and Jehan-Francois Paris

Two-dimensional RAID arrays maintain separate row and column parities for all their disks. Depend... more Two-dimensional RAID arrays maintain separate row and column parities for all their disks. Depending on their organization, they can tolerate between two and three concurrent disk failures without losing any data. We propose to enhance the robustness of these arrays by replacing a small fraction of these drives with storage class memory devices, and demonstrate how such a pairing is

Protecting Data against Consecutive Disk Failures in RAID-5

2009 Mexican International Conference on Computer Science, 2009

"An Analytical Study of Strategy-Griented Restructuring Algorithms

Considerable expertr::lental eVidence has been accumulated showing that the performance ot progra... more Considerable expertr::lental eVidence has been accumulated showing that the performance ot programs in virtual memory environments can be signifi· cantly Improved by restT".J.cturing the programs. I.e. by modifying their block-la-page or block-ta-segment mapping. This evidence also points aut that the so-caned strategy-o'r'iented algorithms. which base their decisions on the lmoW"ledge of the memory management strategy under which the pro· gram will run, are more efficient than those algorithms whicr ~ do not take this strategy into account. We present here some thecretical argwnents to explain Why strateg~' oriented algorithms perform better than other program restructuring algo-rithms and deterI!1ine the conditions under which these algorithms are op· timUI:::1. In particular. we prove that the algorith.."11s oriented towacds the working set or sampled working set policy are optimum when applied to pro-grams haVing no more than two blocks per page. an...

Scalable approaches for tree-based reliable multicast

Tree-based reliable multicast protocols provide scalability by distributing error-recovery tasks ... more Tree-based reliable multicast protocols provide scalability by distributing error-recovery tasks among several repair nodes. These repair nodes keep in their buffers all recently received packets and perform error recovery for their receiver nodes. This work addresses two open issues in tree-based protocols. The first is how to construct a logical tree in an efficient manner. We propose three efficient hybrid schemes for constructing a well-organized logical tree with reasonable message and time overhead. The second open issue is when to discard packets from the buffers of repair nodes. Discarding packets that might still be needed is unacceptable, because it would force the receiver nodes to contact the sender node whenever one of them needs a retransmission of a discarded packet. Schemes addressing this issue can be broadly divided into ACK-based and NAK-based schemes. ACK-based schemes require each repair node to receive one acknowledgement from each of its receiver node for each...

Andwidth a Llocation I Ssues in N Ear V Ideo on D Emand S Ervices

One way to reduce the cost of VOD is to schedule repeated broadcasts of the videos that are likel... more One way to reduce the cost of VOD is to schedule repeated broadcasts of the videos that are likely to be watched by many viewers rather than waiting for individual requests. This technique is known as video broadcasting. The savings that can be achieved are considerable, as it is often the case that 40 percent of the demand is for a small number, say, 10 to 20, of hot videos [3]. Depending on the frequency at which these videos are rebroadcast, customers may have to wait between a few minutes to, say, half an hour before watching the video of their choice. Hence this type of service can either be referred to as near video on demand (NVOD) or enhanced pay per view (EPPV).

Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018

Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreli... more Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high faulttolerance to improve data durability and availability even in catastrophic scenarios. By "flexible and practical", we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.

Merkle Hash Grids Instead of Merkle Trees

2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Merkle grids are a new data organization that replicates the functionality of Merkle trees while ... more Merkle grids are a new data organization that replicates the functionality of Merkle trees while reducing their transmission and storage costs by up to 50 percent. All Merkle grids organize the objects whose conformity they monitor in a square array. They add row and column hashes to it such that (a) all row hashes contain the hash of the concatenation of the hashes of all the objects in their respective row and (b) all column hashes contain the hash of the concatenation of the hashes of all the objects in their respective column. In addition, a single signed master hash contains the hash of the concatenation of all row and column hashes. Extended Merkle grids add two auxiliary Merkle trees to speed up searches among both row hashes and column hashes. While both basic and extended Merkle grids perform authentication of all blocks better than Merkle trees, only extended Merkle grids can locate individual non-conforming objects or authenticate a single non-conforming object as fast as Merkle trees.

A proactive implementation of interactive video-on-demand

Conference Proceedings of the 2003 IEEE International

Most broadcasting protocols for video-on-demand do not allow the customer to pause, move fast-for... more Most broadcasting protocols for video-on-demand do not allow the customer to pause, move fast-forward or backward while watching a video. We propose a broadcasting protocol implementing these features in a purely proactive fashion. 12 Our protocol implements rewind and pause interactions at the set-top box level by requiring the set-top box to keep in its buffer all video data it has received from the server until the customer has finished watching the video. It implements fast-forward by letting the video server transmit video data more frequently than needed by customers watching the video in sequence. As a result, any customer having watched the first x minutes of a video will be able to fast-forward to any scene within the first 2x or 3x minutes of the video. We show that this expanding horizon feature can be provided at a reasonable cost. We also show how our protocol can accommodate customers connected to the service through a device lacking either the ability to receive data at more than two times the video consumption rate or the storage space required to store more than 20 to 25 percent of the video they are watching. While these customers will not have access to any of the interactive features provided by our protocol, they will be able to watch videos after the same wait time as all other customers.

Three-dimensional RAID Arrays with Fast Repairs

2021 International Conference on Electrical, Computer and Energy Technologies (ICECET), 2021

Large data storage systems often use Reed-Solomon erasure codes to protect their static data agai... more Large data storage systems often use Reed-Solomon erasure codes to protect their static data against triple or even quadruple device failures. A main drawback of this approach is the high cost of recovering the contents of failed devices, as it requires accessing the contents of a large number of surviving devices. We present a three-dimensional RAID organization that adds vertical parity devices to a stack of identical two-dimensional RAID arrays. These new vertical parity devices will let the organization recover faster from all single device failures while greatly reducing the risk of data loss. Depending on the way the vertical parities are defined, the new arrays will either toler-ate all triple failures and more than 99.9 percent of all quadruple failures, or all quintuple failures and more than 99.995 percent of all sextuple failures.

Voting without version numbers

1997 IEEE International Performance, Computing and Communications Conference

Voting protocols are widely used to provide mutual exclusion in distributed systems and to guaran... more Voting protocols are widely used to provide mutual exclusion in distributed systems and to guarantee the consistency of replicated data in the presence of network partitions. Unfortunately, the most e cient voting protocols require fairly complex metadata to assert which replicas are up-to-date and to denote the replicas that belong to that set. We present a m uch simpler technique that does not require version numbers and maintains only n+logn bits of state per replica. We show, under standard Markovian assumptions, that a static voting protocol using our method provides nearly the same data availability as a static voting protocol using version numbers. We also describe a dynamic voting protocol using our method that provides the same data availability as a dynamic voting protocol using much more complex metadata.

Using entanglements to increase the reliability of two-dimensional square RAID arrays

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), 2017

Two-dimensional square RAID arrays organize their data disks in such a way that each of them belo... more Two-dimensional square RAID arrays organize their data disks in such a way that each of them belongs to exactly one row parity stripe and one column parity stripe. Even so, they remain vulnerable to the loss of any given data disk and the parity disks of its two stripes. We show how to eliminate all but one of these fatal triple failures by entangling the parity disks of the array, that is, XORing the contents of each parity disk with that of its predecessors. As a result, our new organization reduces the number of fatal triple failures by 96 to 99 percent and the number of fatal quadruple failures by around 85 percent without the need for any additional hardware.

The case for aggressivepartial preloadingin broadcasting protocols for video-on-demand

IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001

Broadcasting protocols for video-on-demand usually consume over fifty percent of their bandwidth ... more Broadcasting protocols for video-on-demand usually consume over fifty percent of their bandwidth to distribute the first ten to fifteen minutes of the videos they distribute. Since all these protocols require the user set-top box to include a disk drive, we propose to use this drive to store the first five to twenty minutes of the ten to twenty most popular videos. This will provide low-cost instant access to these videos.

Disk failure prediction in heterogeneous environments

2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), 2017

Recent studies have shown the benefits of using SMART attributes to predict disk failures in homo... more Recent studies have shown the benefits of using SMART attributes to predict disk failures in homogeneous populations of disks from the same make and model. We address here the case of data centers with more heterogeneous disk populations, such as the ones described in the BackBlaze datasets, and propose to build global disk failure predictors that would apply to disks of all makes and models. Our first challenge was the large number of SMART parameters that were missing for most makes and models in many disk instances of our dataset. As a result, we had to discard the SMART attributes that were missing in at least 90 percent of the disks, which left us with 21 SMART attributes. We then applied a Reverse Arrangement Test to these attributes to select the strongest disk failure indicators. We investigated three different machine learning models (Decision Trees, Neural Networks, and Logistic Regression) using the 2015 BackBlaze data to train and validate our predictors. Our best model was a decision tree that identified true failure events among the disks that tested positive for at least one of our failure indicators. We then used the 2016 BackBlaze data to evaluate its performance. Our results show that our decision tree identifies at least 52 percent of all disk failures and makes nearly all its predictions several days ahead: no more than 2.45 percent of the predicted failures occur within one day or two of the prediction. Finally, we compared the performance of our predictor with those of the RAIDShield and the original BackBlaze predictor. We found out that RAIDShield could predict at most 18 percent of disk failures, that is, 34 percent fewer failures than our decision tree while the BackBlaze predictor predicted 60 percent of disk failures but generated 4 to 5 false alarms per correct prediction.

RESAR: Reliable Storage at Exabyte Scale

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Stored data needs to be protected against device failure and irrecoverable sector read errors, ye... more Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust, Efficient, Scalable, Autonomous, Reliable) storage, an approach to storage system redundancy that only uses XORbased parity and employs a graph to lay out data and parity. The RESAR layout offers greater robustness and higher flexibility for repair at the same overhead as a declustered version of RAID 6. For instance, a RESAR-based layout with 16 data disklets per stripe has about 50 times lower probability of suffering data loss in the presence of a fixed number of failures than a corresponding RAID 6 organization. RESAR uses a layer of virtual storage elements to achieve better manageability, a broader potential for energy savings, as well as easier adoption of heterogeneous storage devices.

Simple data entanglement layouts with high reliability

2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC), 2016

We study the reliability of open and close entanglements, two simple data distribution layouts fo... more We study the reliability of open and close entanglements, two simple data distribution layouts for log-structured append-only storage systems. Both techniques use equal numbers of data and parity drives and generate their parity data by computing the exclusive or (XOR) of the most recently appended data with the contents of their last parity drive. While open entanglements maintain an open chain of data and parity drives, closed entanglements include the exclusive or of the contents of their first and last data drives. We evaluate five-year reliabilities of open and closed entanglements, for two different array sizes and drive failure rates. Our results show that open entanglements provide much better five-year reliabilities than mirroring and reduce the probability of a data loss by at least 90 percent over a period of five years. Closed entanglements perform even better and reduce the same probability by at least 98 percent.

Pirogue, a lighter dynamic version of the Raft distributed consensus algorithm

2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), 2015

Raft is a new distributed consensus algorithm that is easier to understand than the older Paxos a... more Raft is a new distributed consensus algorithm that is easier to understand than the older Paxos algorithm. Raft's major drawback is its high energy footprint: as it relies on static quorums for deciding when it can commit updates, it requires five participants to protect against two simultaneous failures. We propose to reduce this footprint by replacing the static quorums that Raft currently uses by quorums that vary according to the number of currently available participants. We present first a modified dynamic-linear voting protocol that disables singleserver updates and show that a Raft cluster with four participants managed by this protocol would be almost as available as a conventional Raft cluster with five participants and always tolerate the irrecoverable failure of any single participant without any data loss. In addition, we show a Raft cluster with three participants and a witness managed by an unmodified dynamic-linear voting protocol would be more available than a conventional Raft cluster with five participants and could still tolerate most irrecoverable failures of any single participant while maintaining recoverability.

Reducing the Energy Footprint of a Distributed Consensus Algorithm

2015 11th European Dependable Computing Conference (EDCC), 2015

The Raft consensus algorithm is a new distributed consensus algorithm that is both easier to unde... more The Raft consensus algorithm is a new distributed consensus algorithm that is both easier to understand and more straightforward to implement than the older Paxos algorithm. Its major limitation is its high energy footprint. As it relies on majority consensus voting for deciding when to commit an update, Raft requires five participants to protect against two simultaneous failures. We propose two methods for reducing this huge energy footprint. Our first proposal consists of adjusting Raft quorums in a way that would allow updates to proceed with as few as two servers while requiring a larger quorum for electing a new leader. Our second proposal consists of replacing one or two of the five Raft servers with witnesses, that is, lightweight servers that maintain the same metadata as other servers but hold no data and can therefore run on very low-power hosts. We show that these substitutions have little impact on the cluster availability but very different impacts on the risks of incurring a data loss.

The Case For Aggressive Partial Preloading In

ABSTRACT

An efficient implementation of interactive video-on-demand

Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728)

The key performance bottleneck for a video-on-demand (VOD) server is bandwidth, which controls th... more The key performance bottleneck for a video-on-demand (VOD) server is bandwidth, which controls the number of clients the server can simultaneously support. Previous work has shown that a strategy called stream tapping can make efficient use of bandwidth when clients are not allowed to interact (through VCR-like controls) with the video they are viewing. Here we present an interactive version of stream tapping and analyze its performance through the use of discrete event simulation. In particular, we show that stream tapping can use as little as 10% of the bandwidth required by dedicating a unique stream of data to each client request.

A low bandwidth broadcasting protocol for video on demand

Proceedings 7th International Conference on Computer Communications and Networks (Cat. No.98EX226)

Broadcasting protocols can improve the efficiency of video on demand services by reducing the ban... more Broadcasting protocols can improve the efficiency of video on demand services by reducing the bandwidth required to transmit videos that are simultaneously watched by many viewers. We present here a polyharmonic broadcasting protocol that requires less bandwidth than the best extant protocols to achieve the same low maximum waiting time. We also show how to modify the protocol to accommodate very long videos without increasing the buffering capacity of the set-top box.

Using storage class memories to increase the reliability of two-dimensional RAID arrays

by Ahmed Amer and Jehan-Francois Paris

Two-dimensional RAID arrays maintain separate row and column parities for all their disks. Depend... more Two-dimensional RAID arrays maintain separate row and column parities for all their disks. Depending on their organization, they can tolerate between two and three concurrent disk failures without losing any data. We propose to enhance the robustness of these arrays by replacing a small fraction of these drives with storage class memory devices, and demonstrate how such a pairing is

Protecting Data against Consecutive Disk Failures in RAID-5

2009 Mexican International Conference on Computer Science, 2009

"An Analytical Study of Strategy-Griented Restructuring Algorithms

Considerable expertr::lental eVidence has been accumulated showing that the performance ot progra... more Considerable expertr::lental eVidence has been accumulated showing that the performance ot programs in virtual memory environments can be signifi· cantly Improved by restT".J.cturing the programs. I.e. by modifying their block-la-page or block-ta-segment mapping. This evidence also points aut that the so-caned strategy-o'r'iented algorithms. which base their decisions on the lmoW"ledge of the memory management strategy under which the pro· gram will run, are more efficient than those algorithms whicr ~ do not take this strategy into account. We present here some thecretical argwnents to explain Why strateg~' oriented algorithms perform better than other program restructuring algo-rithms and deterI!1ine the conditions under which these algorithms are op· timUI:::1. In particular. we prove that the algorith.."11s oriented towacds the working set or sampled working set policy are optimum when applied to pro-grams haVing no more than two blocks per page. an...

Scalable approaches for tree-based reliable multicast

Tree-based reliable multicast protocols provide scalability by distributing error-recovery tasks ... more Tree-based reliable multicast protocols provide scalability by distributing error-recovery tasks among several repair nodes. These repair nodes keep in their buffers all recently received packets and perform error recovery for their receiver nodes. This work addresses two open issues in tree-based protocols. The first is how to construct a logical tree in an efficient manner. We propose three efficient hybrid schemes for constructing a well-organized logical tree with reasonable message and time overhead. The second open issue is when to discard packets from the buffers of repair nodes. Discarding packets that might still be needed is unacceptable, because it would force the receiver nodes to contact the sender node whenever one of them needs a retransmission of a discarded packet. Schemes addressing this issue can be broadly divided into ACK-based and NAK-based schemes. ACK-based schemes require each repair node to receive one acknowledgement from each of its receiver node for each...

Andwidth a Llocation I Ssues in N Ear V Ideo on D Emand S Ervices

One way to reduce the cost of VOD is to schedule repeated broadcasts of the videos that are likel... more One way to reduce the cost of VOD is to schedule repeated broadcasts of the videos that are likely to be watched by many viewers rather than waiting for individual requests. This technique is known as video broadcasting. The savings that can be achieved are considerable, as it is often the case that 40 percent of the demand is for a small number, say, 10 to 20, of hot videos [3]. Depending on the frequency at which these videos are rebroadcast, customers may have to wait between a few minutes to, say, half an hour before watching the video of their choice. Hence this type of service can either be referred to as near video on demand (NVOD) or enhanced pay per view (EPPV).

Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018

Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreli... more Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high faulttolerance to improve data durability and availability even in catastrophic scenarios. By "flexible and practical", we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.

Merkle Hash Grids Instead of Merkle Trees

2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Merkle grids are a new data organization that replicates the functionality of Merkle trees while ... more Merkle grids are a new data organization that replicates the functionality of Merkle trees while reducing their transmission and storage costs by up to 50 percent. All Merkle grids organize the objects whose conformity they monitor in a square array. They add row and column hashes to it such that (a) all row hashes contain the hash of the concatenation of the hashes of all the objects in their respective row and (b) all column hashes contain the hash of the concatenation of the hashes of all the objects in their respective column. In addition, a single signed master hash contains the hash of the concatenation of all row and column hashes. Extended Merkle grids add two auxiliary Merkle trees to speed up searches among both row hashes and column hashes. While both basic and extended Merkle grids perform authentication of all blocks better than Merkle trees, only extended Merkle grids can locate individual non-conforming objects or authenticate a single non-conforming object as fast as Merkle trees.

A proactive implementation of interactive video-on-demand

Conference Proceedings of the 2003 IEEE International

Most broadcasting protocols for video-on-demand do not allow the customer to pause, move fast-for... more Most broadcasting protocols for video-on-demand do not allow the customer to pause, move fast-forward or backward while watching a video. We propose a broadcasting protocol implementing these features in a purely proactive fashion. 12 Our protocol implements rewind and pause interactions at the set-top box level by requiring the set-top box to keep in its buffer all video data it has received from the server until the customer has finished watching the video. It implements fast-forward by letting the video server transmit video data more frequently than needed by customers watching the video in sequence. As a result, any customer having watched the first x minutes of a video will be able to fast-forward to any scene within the first 2x or 3x minutes of the video. We show that this expanding horizon feature can be provided at a reasonable cost. We also show how our protocol can accommodate customers connected to the service through a device lacking either the ability to receive data at more than two times the video consumption rate or the storage space required to store more than 20 to 25 percent of the video they are watching. While these customers will not have access to any of the interactive features provided by our protocol, they will be able to watch videos after the same wait time as all other customers.

Three-dimensional RAID Arrays with Fast Repairs

2021 International Conference on Electrical, Computer and Energy Technologies (ICECET), 2021

Large data storage systems often use Reed-Solomon erasure codes to protect their static data agai... more Large data storage systems often use Reed-Solomon erasure codes to protect their static data against triple or even quadruple device failures. A main drawback of this approach is the high cost of recovering the contents of failed devices, as it requires accessing the contents of a large number of surviving devices. We present a three-dimensional RAID organization that adds vertical parity devices to a stack of identical two-dimensional RAID arrays. These new vertical parity devices will let the organization recover faster from all single device failures while greatly reducing the risk of data loss. Depending on the way the vertical parities are defined, the new arrays will either toler-ate all triple failures and more than 99.9 percent of all quadruple failures, or all quintuple failures and more than 99.995 percent of all sextuple failures.

Voting without version numbers

1997 IEEE International Performance, Computing and Communications Conference

Voting protocols are widely used to provide mutual exclusion in distributed systems and to guaran... more Voting protocols are widely used to provide mutual exclusion in distributed systems and to guarantee the consistency of replicated data in the presence of network partitions. Unfortunately, the most e cient voting protocols require fairly complex metadata to assert which replicas are up-to-date and to denote the replicas that belong to that set. We present a m uch simpler technique that does not require version numbers and maintains only n+logn bits of state per replica. We show, under standard Markovian assumptions, that a static voting protocol using our method provides nearly the same data availability as a static voting protocol using version numbers. We also describe a dynamic voting protocol using our method that provides the same data availability as a dynamic voting protocol using much more complex metadata.

Using entanglements to increase the reliability of two-dimensional square RAID arrays

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), 2017

Two-dimensional square RAID arrays organize their data disks in such a way that each of them belo... more Two-dimensional square RAID arrays organize their data disks in such a way that each of them belongs to exactly one row parity stripe and one column parity stripe. Even so, they remain vulnerable to the loss of any given data disk and the parity disks of its two stripes. We show how to eliminate all but one of these fatal triple failures by entangling the parity disks of the array, that is, XORing the contents of each parity disk with that of its predecessors. As a result, our new organization reduces the number of fatal triple failures by 96 to 99 percent and the number of fatal quadruple failures by around 85 percent without the need for any additional hardware.

The case for aggressivepartial preloadingin broadcasting protocols for video-on-demand

IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001

Broadcasting protocols for video-on-demand usually consume over fifty percent of their bandwidth ... more Broadcasting protocols for video-on-demand usually consume over fifty percent of their bandwidth to distribute the first ten to fifteen minutes of the videos they distribute. Since all these protocols require the user set-top box to include a disk drive, we propose to use this drive to store the first five to twenty minutes of the ten to twenty most popular videos. This will provide low-cost instant access to these videos.

Disk failure prediction in heterogeneous environments

2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), 2017

Recent studies have shown the benefits of using SMART attributes to predict disk failures in homo... more Recent studies have shown the benefits of using SMART attributes to predict disk failures in homogeneous populations of disks from the same make and model. We address here the case of data centers with more heterogeneous disk populations, such as the ones described in the BackBlaze datasets, and propose to build global disk failure predictors that would apply to disks of all makes and models. Our first challenge was the large number of SMART parameters that were missing for most makes and models in many disk instances of our dataset. As a result, we had to discard the SMART attributes that were missing in at least 90 percent of the disks, which left us with 21 SMART attributes. We then applied a Reverse Arrangement Test to these attributes to select the strongest disk failure indicators. We investigated three different machine learning models (Decision Trees, Neural Networks, and Logistic Regression) using the 2015 BackBlaze data to train and validate our predictors. Our best model was a decision tree that identified true failure events among the disks that tested positive for at least one of our failure indicators. We then used the 2016 BackBlaze data to evaluate its performance. Our results show that our decision tree identifies at least 52 percent of all disk failures and makes nearly all its predictions several days ahead: no more than 2.45 percent of the predicted failures occur within one day or two of the prediction. Finally, we compared the performance of our predictor with those of the RAIDShield and the original BackBlaze predictor. We found out that RAIDShield could predict at most 18 percent of disk failures, that is, 34 percent fewer failures than our decision tree while the BackBlaze predictor predicted 60 percent of disk failures but generated 4 to 5 false alarms per correct prediction.

RESAR: Reliable Storage at Exabyte Scale

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), 2016

Stored data needs to be protected against device failure and irrecoverable sector read errors, ye... more Stored data needs to be protected against device failure and irrecoverable sector read errors, yet doing so at exabyte scale can be challenging given the large number of failures that must be handled. We have developed RESAR (Robust, Efficient, Scalable, Autonomous, Reliable) storage, an approach to storage system redundancy that only uses XORbased parity and employs a graph to lay out data and parity. The RESAR layout offers greater robustness and higher flexibility for repair at the same overhead as a declustered version of RAID 6. For instance, a RESAR-based layout with 16 data disklets per stripe has about 50 times lower probability of suffering data loss in the presence of a fixed number of failures than a corresponding RAID 6 organization. RESAR uses a layer of virtual storage elements to achieve better manageability, a broader potential for energy savings, as well as easier adoption of heterogeneous storage devices.

Simple data entanglement layouts with high reliability

2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC), 2016

We study the reliability of open and close entanglements, two simple data distribution layouts fo... more We study the reliability of open and close entanglements, two simple data distribution layouts for log-structured append-only storage systems. Both techniques use equal numbers of data and parity drives and generate their parity data by computing the exclusive or (XOR) of the most recently appended data with the contents of their last parity drive. While open entanglements maintain an open chain of data and parity drives, closed entanglements include the exclusive or of the contents of their first and last data drives. We evaluate five-year reliabilities of open and closed entanglements, for two different array sizes and drive failure rates. Our results show that open entanglements provide much better five-year reliabilities than mirroring and reduce the probability of a data loss by at least 90 percent over a period of five years. Closed entanglements perform even better and reduce the same probability by at least 98 percent.

Pirogue, a lighter dynamic version of the Raft distributed consensus algorithm

2015 IEEE 34th International Performance Computing and Communications Conference (IPCCC), 2015

Raft is a new distributed consensus algorithm that is easier to understand than the older Paxos a... more Raft is a new distributed consensus algorithm that is easier to understand than the older Paxos algorithm. Raft's major drawback is its high energy footprint: as it relies on static quorums for deciding when it can commit updates, it requires five participants to protect against two simultaneous failures. We propose to reduce this footprint by replacing the static quorums that Raft currently uses by quorums that vary according to the number of currently available participants. We present first a modified dynamic-linear voting protocol that disables singleserver updates and show that a Raft cluster with four participants managed by this protocol would be almost as available as a conventional Raft cluster with five participants and always tolerate the irrecoverable failure of any single participant without any data loss. In addition, we show a Raft cluster with three participants and a witness managed by an unmodified dynamic-linear voting protocol would be more available than a conventional Raft cluster with five participants and could still tolerate most irrecoverable failures of any single participant while maintaining recoverability.

Reducing the Energy Footprint of a Distributed Consensus Algorithm

2015 11th European Dependable Computing Conference (EDCC), 2015

The Raft consensus algorithm is a new distributed consensus algorithm that is both easier to unde... more The Raft consensus algorithm is a new distributed consensus algorithm that is both easier to understand and more straightforward to implement than the older Paxos algorithm. Its major limitation is its high energy footprint. As it relies on majority consensus voting for deciding when to commit an update, Raft requires five participants to protect against two simultaneous failures. We propose two methods for reducing this huge energy footprint. Our first proposal consists of adjusting Raft quorums in a way that would allow updates to proceed with as few as two servers while requiring a larger quorum for electing a new leader. Our second proposal consists of replacing one or two of the five Raft servers with witnesses, that is, lightweight servers that maintain the same metadata as other servers but hold no data and can therefore run on very low-power hosts. We show that these substitutions have little impact on the cluster availability but very different impacts on the risks of incurring a data loss.

The Case For Aggressive Partial Preloading In

ABSTRACT

An efficient implementation of interactive video-on-demand

Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728)

The key performance bottleneck for a video-on-demand (VOD) server is bandwidth, which controls th... more The key performance bottleneck for a video-on-demand (VOD) server is bandwidth, which controls the number of clients the server can simultaneously support. Previous work has shown that a strategy called stream tapping can make efficient use of bandwidth when clients are not allowed to interact (through VCR-like controls) with the video they are viewing. Here we present an interactive version of stream tapping and analyze its performance through the use of discrete event simulation. In particular, we show that stream tapping can use as little as 10% of the bandwidth required by dedicating a unique stream of data to each client request.

A low bandwidth broadcasting protocol for video on demand

Proceedings 7th International Conference on Computer Communications and Networks (Cat. No.98EX226)

Broadcasting protocols can improve the efficiency of video on demand services by reducing the ban... more Broadcasting protocols can improve the efficiency of video on demand services by reducing the bandwidth required to transmit videos that are simultaneously watched by many viewers. We present here a polyharmonic broadcasting protocol that requires less bandwidth than the best extant protocols to achieve the same low maximum waiting time. We also show how to modify the protocol to accommodate very long videos without increasing the buffering capacity of the set-top box.