Skip to main content

Z. Ozsoyoglu

Followers

13

Following

5

Co-authors

5

Public Views

Vasilis Vassalos

Athens University of Economics and Business

Manolis Gergatsoulis

Ionian University

Matthew Damigos

Elke Rundensteiner

University of California, San Diego

Reza Kurniawan chandra

Universiteit Hasselt

Interests

Uploads

Papers by Z. Ozsoyoglu

Rewriting XPath queries using materialized views

Very Large Data Bases, Aug 30, 2005

As a simple XML query language but with enough expressive power, XPath has become very popular. T... more As a simple XML query language but with enough expressive power, XPath has become very popular. To expedite evaluation of XPath queries, we consider the problem of rewriting XPath queries using materialized XPath views. This problem is very important and arises not only from query optimization in server side but also from semantic caching in client side. We consider the problem of deciding whether there exists a rewriting of a query using XPath views and the problem of finding minimal rewritings. We first consider those two problems for a very practical XPath fragment containing the descendent, child, wildcard and branch features. We show that the rewriting existence problem is coNP-hard and the problem of finding minimal rewritings is Σ p 3 . We also consider those two rewriting problems for three subclasses of this XPath fragment, each of which contains child feature and two of descendent, wildcard and branch features, and show that both rewriting problems can be polynomially solved. Finally, we give an algorithm for finding minimal rewritings, which is sound for the XPath fragment, but is also complete and runs in polynomial time for its three subclasses.

Processing OODB queries by O-Algebra

O-Algebra is an object algebra designed for processing Object-Oriented Database (OODB) queries. W... more O-Algebra is an object algebra designed for processing Object-Oriented Database (OODB) queries. We present the concept of internal type objects, which are uniform and represent the front-end objects. O-Algebra is an algebra whose operands are collections of internal objects. Due to uniform operands and simple operators defined in O-Algebra, a small yet powerful set of O-Algebra laws can be obtained, which is important for query optimization by algebraic rewriting. After presenting O-Algebra, we introduce an approach to transform OQL queries to O-Algebra queries. Since O-Algebra operations do not have complex arguments, the nested queries of OQL can be reduced by a general method after they are transformed t o O-Algebra queries. Compared to other approaches of reducing nested queries, this approach k more general because it is not restricted by the patterns of nested queries.

A graphical query language: VISUAL and its query processing

IEEE Transactions on Knowledge and Data Engineering, Sep 1, 2002

Abstract This paper describes VISUAL, a graphical icon-based query language with a user-friendly ... more Abstract This paper describes VISUAL, a graphical icon-based query language with a user-friendly graphical user interface for scientific databases, and its query processing techniques. VISUAL is suitable for domains where visualization of the relationships is important for the domain scientist to express queries. In VISUAL, graphical objects are not tied to the underlying formalism; instead, they represent the relationships of

On efficient reasoning with implication constraints

Lecture Notes in Computer Science, 1993

In this paper, we address the complexity issue of reasoning with implication constraints. We cons... more In this paper, we address the complexity issue of reasoning with implication constraints. We consider the IC-RFT problem, which is the problem of deciding whether a conjunctive yes/no query always produces the empty relation (\no" answer) on database instances satisfying a given set of implication constraints, as a central problem in this respect. We show that several other important problems, such as the query containment problem, are polynomially equivalent to the IC-RFT problem. More importantly, we give criteria for designing a set of implication constraints so that an e cient \units-refutation" process can be used to solve the IC-RFT problem.

A normal form for nested relations

Page 1. A NORMAL FORM FOR NESTED RELATIONS* (Extended Abstract) Z. Meral Ozsoyoglu and Li-Yan Yua... more

A complete translation from SPARQL into efficient SQL

Abstract This paper presents a feature-complete translation from SPARQL, the proposed standard fo... more Abstract This paper presents a feature-complete translation from SPARQL, the proposed standard for RDF querying, into efficient SQL. We propose&amp;amp;amp;amp;amp;amp;amp;quot; SQL model&amp;amp;amp;amp;amp;amp;amp;quot;-based algorithms that implement each SPARQL algebra operator via SQL query augmentation, and ...

A tree-structured query interface for querying semi-structured data

Statistical and Scientific Database Management, Jun 21, 2004

XML is a semistructured data format that is quickly becoming the standard means of communication ... more XML is a semistructured data format that is quickly becoming the standard means of communication across the Internet. We introduce a simple, powerful, unambiguous tree-structured user-interface for querying semistructured data. It emphasizes ease-of-use and provides generic access to XML repositories. An implementation of this interface called the Pathways Explorer has been built in the context of biological pathways. The implementation illustrates some basic capabilities, such as selection, projection, equijoins, and self-joins. It builds efficient queries using SQL, and it has been shown that XQuery's performance would be comparable. Future implementations might allow semantic client-side caching, aggregation and grouping, recursion, and set operations.

A Framework for Querying Pedigree Data

ABSTRACT Genealogy information is becoming increasingly abundant in light of modern genetics and ... more ABSTRACT Genealogy information is becoming increasingly abundant in light of modern genetics and the study of diseases and risk factors. As the volume of this structured pedigree data expands, there is a pressing need for better ways to manage, store, and efficiently query this data. Building on recent advances in semi-structured data management and proven relational database technology, we propose a general-purpose pedigree query language (PQL) and evaluation framework for elegantly expressing and efficiently evaluating queries on this data. In this paper, we describe how the problem of modeling and querying pedigree data differs from XML, present an overview of PQL, and present efficient evaluation for key parts of the language. Experimental results using real data show significant (&gt;850%) performance improvement for complex queries over naive evaluation

Metadata-based modeling of information resources on the Web

Journal of the Association for Information Science and Technology, Oct 29, 2003

This paper deals with the problem of modeling Web information resources using expert knowledge an... more This paper deals with the problem of modeling Web information resources using expert knowledge and personalized user information for improved Web searching capabilities. We propose a "Web information space" model, which is composed of Web-based information resources (HTML/XML [Hypertext Markup Language/Extensible Markup Language] documents on the Web), expert advice repositories (domain-expert-specified metadata for information resources), and personalized information about users (captured as user profiles that indicate users' preferences about experts as well as users' knowledge about topics). Expert advice, the heart of the Web information space model, is specified using topics and relationships among topics (called metalinks), along the lines of the recently proposed topic maps. Topics and metalinks constitute metadata that describe the contents of the underlying HTML/XML Web resources. The metadata specification process is semiautomated, and it exploits XML DTDs (Document Type Definition) to allow domain-expert guided mapping of DTD elements to topics and metalinks. The expert advice is stored in an object-relational database management system (DBMS). To demonstrate the practicality and usability of the proposed Web information space model, we created a prototype expert advice repository of more than one million topics/metalinks for DBLP (Database and Logic Programming) Bibliography data set. We also present a query interface that provides sophisticated querying fa-cilities for DBLP Bibliography resources using the expert advice repository.

Improvements in distance-based indexing

Statistical and Scientific Database Management, Jun 21, 2004

This work offers some improvements in the current distance-based indexing techniques. An optimal ... more This work offers some improvements in the current distance-based indexing techniques. An optimal similarity search algorithm that is adopted from vector-based indexing is shown to be also optimal for distance-based indices. Farther similarity between the two types of indexing is revealed, leading to a general description of search structures. A probabilistic analysis of distance-based tree indices is also shown to

On automated lesson construction from electronic textbooks

IEEE Transactions on Knowledge and Data Engineering, Mar 1, 2004

An electronic book may be viewed as an application with a multimedia database. We define an elect... more An electronic book may be viewed as an application with a multimedia database. We define an electronic textbook as an electronic book that is used in conjunction with instructional resources such as lectures. In this paper, we propose an electronic textbook data model with topics, topic sources, metalinks (relationships among topics), and instructional modules which are multimedia presentations possibly capturing real-life lectures of instructors. Using the data model, the system provides users a topic-guided multimedia lesson construction. This paper concentrates, in detail, on the use of one metalink type in lesson construction, namely, prerequisite dependencies, and provides a sound and complete axiomatization of prerequisite dependencies. We present a simple automated way of constructing lessons for users where the user lists a set of topic names (s)he is interested in, and the system automatically constructs and delivers the "best" user-tailored lesson as a multimedia presentation, where "best" is characterized in terms of both topic closures with respect to prerequisite dependencies and what the user knows about topics. We model and present sample lesson construction requests for users, discuss their complexity, and give algorithms that evaluate such requests. For expensive lesson construction requests, we list heuristics and empirically evaluate their performance. We also discuss the worst-case performance guarantees of lesson request algorithms.

A comparison of methods for semantic photo annotation suggestion

ABSTRACT In this work we describe a general framework for semi-automated semantic digital photo a... more ABSTRACT In this work we describe a general framework for semi-automated semantic digital photo annotation though the use of suggestions. We compare context-based methods with Latent Semantic Indexing, a linear algebra approach to information retrieval. Through experiments on real data sets containing up to 13,705 semantically annotated photos, we show that a carefully chosen combination of context-based methods can not only be efficient, but also extremely effective as well. Furthermore, we propose a new combination of context-based methods that outperforms previous work by up to 19% higher recall while running up to 21 times faster.

Efficiently calculating inbreeding on large pedigrees databases

Information Systems, Sep 1, 2009

We consider pedigree data structured in the form of a directed acyclic graph, and use an encoding... more We consider pedigree data structured in the form of a directed acyclic graph, and use an encoding scheme, called NodeCodes, for expediting the evaluation of queries on pedigree graph structures. Inbreeding is the quantitative measure of the genetic relationship between two individuals. The inbreeding coefficient is related to the probability that both copies of any given gene are received from the same ancestor. In this paper we discuss the evaluation of the inbreeding coefficient of a given individual using NodeCodes and propose a new encoding scheme, Family NodeCodes, which is further optimized for pedigree graphs. We implemented and tested these approaches on both synthetic and real pedigree data in terms of performance and scalability. Experimental results show that the use of NodeCodes provides a good alternative for queries involving the inbreeding coefficient, with significant improvements over the traditional iterative evaluation methods (up to 10.1 times faster), and Family NodeCodes further improves this to 77.1 times faster while using 91% less space than regular NodeCodes.

Classification of Large Microarray Datasets Using Fast Random Forest Construction

Journal of Bioinformatics and Computational Biology, Apr 1, 2011

Random forest is an ensemble classification algorithm. It performs well when most predictive vari... more Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high dimensional microarray data sets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.

Synchronized counting method

A direct extension to the counting method is presented which can deal efficiently with both acycl... more A direct extension to the counting method is presented which can deal efficiently with both acyclic and cyclic relations. The extension to cycle cases, called the synchronized counting method, is simulated and studied using a Petri net model. Worst-case analysis shows that n2 semijoin operations are required, where n is the number of nodes in the graph representing the relevant

Genomic pathways database and biological data management

Animal Genetics, Aug 1, 2006

In this paper we discuss the properties of biological data, and challenges it poses for data mana... more In this paper we discuss the properties of biological data, and challenges it poses for data management, and argue that, in order to meet the data management requirements for "digital biology", careful integration of the existing technologies and the development of new data management techniques for biological data are needed. Based on this premise, we present PathCase: Case Pathways Database System. PathCase is an integrated set of software tools for modeling, storing, analyzing, visualizing, and querying biological pathways data at different levels of genetic, molecular, biochemical and organismal detail. The novel features of the system include: (a) genomic information integrated with other biological data and presented starting from pathways, (b) design for biologists who are possibly unfamiliar with genomics, but whose research is essential for annotating gene and genome sequences with biological functions, (c) database design, implementation and graphical tools which enable users to visualize pathways data in multiple abstraction levels, and to pose exploratory queries, (d) a wide range of different types of queries including, "path" and "neighborhood queries", and graphical visualization of query outputs, and, (e) an implementation that allows for web(XML)-based dissemination of query outputs (i.e., pathways data in BIOPAX format) to researchers in the community, giving them control on the use of pathways data.

As more RDF data management systems and RDF data querying techniques emerge, RDF benchmarks provi... more As more RDF data management systems and RDF data querying techniques emerge, RDF benchmarks providing a controllable and comparable testing environment for applications are needed. To address the needs of diverse applications, we propose an application-specific framework, called RBench, to generate RDF benchmarks. RBench takes an RDF dataset from any application as a template, and generates a set of synthetic datasets with similar characteristics including graph structure and literal labels, for the required "size scaling factor" and the "degree scaling factor". RBench analyzes several features from the given RDF dataset, and uses them to reconstruct the new benchmark graph. A flexible query load generation process is then proposed according to the design of RBench. Efficiency and usability of RBench are demonstrated via experimental results.

A graph query language and its query processing

Many new database applications involve querying of graph data. In this paper, we present an objec... more

Topic-Centric Querying of Web Information Resources

Springer eBooks, 2001

This paper deals with the problem of modeling web information resources using expert knowledge an... more This paper deals with the problem of modeling web information resources using expert knowledge and personalized user information, and querying them in terms of topics and topic relationships. We propose a model for web information resources, and a query language SQL-TC (Topic-Centric SQL) to query the model. The model is composed of web-based information resources (XML or HTML documents on the web), expert advice repositories (domain-expert-specified metadata for information resources), and personalized information about users (captured as user profiles, that indicate users' preferences as to which expert advice they would like to follow, and which to ignore, etc). The query language SQL-TC makes use of the metadata information provided in expert advice repositories and embedded in information resources, and employs user preferences to further refine the query output. Query output objects/tuples are ranked with respect to the (expert-judged and userpreference-revised) importance values of requested topics/metalinks, and the query output is limited by either top n-ranked objects/tuples, or objects/tuples with importance values above a given threshold, or both.

Research in Time-and Error-Constrained Database Query Processing

Rewriting XPath queries using materialized views

Very Large Data Bases, Aug 30, 2005

As a simple XML query language but with enough expressive power, XPath has become very popular. T... more As a simple XML query language but with enough expressive power, XPath has become very popular. To expedite evaluation of XPath queries, we consider the problem of rewriting XPath queries using materialized XPath views. This problem is very important and arises not only from query optimization in server side but also from semantic caching in client side. We consider the problem of deciding whether there exists a rewriting of a query using XPath views and the problem of finding minimal rewritings. We first consider those two problems for a very practical XPath fragment containing the descendent, child, wildcard and branch features. We show that the rewriting existence problem is coNP-hard and the problem of finding minimal rewritings is Σ p 3 . We also consider those two rewriting problems for three subclasses of this XPath fragment, each of which contains child feature and two of descendent, wildcard and branch features, and show that both rewriting problems can be polynomially solved. Finally, we give an algorithm for finding minimal rewritings, which is sound for the XPath fragment, but is also complete and runs in polynomial time for its three subclasses.

Processing OODB queries by O-Algebra

O-Algebra is an object algebra designed for processing Object-Oriented Database (OODB) queries. W... more O-Algebra is an object algebra designed for processing Object-Oriented Database (OODB) queries. We present the concept of internal type objects, which are uniform and represent the front-end objects. O-Algebra is an algebra whose operands are collections of internal objects. Due to uniform operands and simple operators defined in O-Algebra, a small yet powerful set of O-Algebra laws can be obtained, which is important for query optimization by algebraic rewriting. After presenting O-Algebra, we introduce an approach to transform OQL queries to O-Algebra queries. Since O-Algebra operations do not have complex arguments, the nested queries of OQL can be reduced by a general method after they are transformed t o O-Algebra queries. Compared to other approaches of reducing nested queries, this approach k more general because it is not restricted by the patterns of nested queries.

A graphical query language: VISUAL and its query processing

IEEE Transactions on Knowledge and Data Engineering, Sep 1, 2002

Abstract This paper describes VISUAL, a graphical icon-based query language with a user-friendly ... more Abstract This paper describes VISUAL, a graphical icon-based query language with a user-friendly graphical user interface for scientific databases, and its query processing techniques. VISUAL is suitable for domains where visualization of the relationships is important for the domain scientist to express queries. In VISUAL, graphical objects are not tied to the underlying formalism; instead, they represent the relationships of

On efficient reasoning with implication constraints

Lecture Notes in Computer Science, 1993

In this paper, we address the complexity issue of reasoning with implication constraints. We cons... more In this paper, we address the complexity issue of reasoning with implication constraints. We consider the IC-RFT problem, which is the problem of deciding whether a conjunctive yes/no query always produces the empty relation (\no" answer) on database instances satisfying a given set of implication constraints, as a central problem in this respect. We show that several other important problems, such as the query containment problem, are polynomially equivalent to the IC-RFT problem. More importantly, we give criteria for designing a set of implication constraints so that an e cient \units-refutation" process can be used to solve the IC-RFT problem.

A normal form for nested relations

Page 1. A NORMAL FORM FOR NESTED RELATIONS* (Extended Abstract) Z. Meral Ozsoyoglu and Li-Yan Yua... more

A complete translation from SPARQL into efficient SQL

Abstract This paper presents a feature-complete translation from SPARQL, the proposed standard fo... more Abstract This paper presents a feature-complete translation from SPARQL, the proposed standard for RDF querying, into efficient SQL. We propose&amp;amp;amp;amp;amp;amp;amp;quot; SQL model&amp;amp;amp;amp;amp;amp;amp;quot;-based algorithms that implement each SPARQL algebra operator via SQL query augmentation, and ...

A tree-structured query interface for querying semi-structured data

Statistical and Scientific Database Management, Jun 21, 2004

XML is a semistructured data format that is quickly becoming the standard means of communication ... more XML is a semistructured data format that is quickly becoming the standard means of communication across the Internet. We introduce a simple, powerful, unambiguous tree-structured user-interface for querying semistructured data. It emphasizes ease-of-use and provides generic access to XML repositories. An implementation of this interface called the Pathways Explorer has been built in the context of biological pathways. The implementation illustrates some basic capabilities, such as selection, projection, equijoins, and self-joins. It builds efficient queries using SQL, and it has been shown that XQuery's performance would be comparable. Future implementations might allow semantic client-side caching, aggregation and grouping, recursion, and set operations.

A Framework for Querying Pedigree Data

ABSTRACT Genealogy information is becoming increasingly abundant in light of modern genetics and ... more ABSTRACT Genealogy information is becoming increasingly abundant in light of modern genetics and the study of diseases and risk factors. As the volume of this structured pedigree data expands, there is a pressing need for better ways to manage, store, and efficiently query this data. Building on recent advances in semi-structured data management and proven relational database technology, we propose a general-purpose pedigree query language (PQL) and evaluation framework for elegantly expressing and efficiently evaluating queries on this data. In this paper, we describe how the problem of modeling and querying pedigree data differs from XML, present an overview of PQL, and present efficient evaluation for key parts of the language. Experimental results using real data show significant (&gt;850%) performance improvement for complex queries over naive evaluation

Metadata-based modeling of information resources on the Web

Journal of the Association for Information Science and Technology, Oct 29, 2003

This paper deals with the problem of modeling Web information resources using expert knowledge an... more This paper deals with the problem of modeling Web information resources using expert knowledge and personalized user information for improved Web searching capabilities. We propose a "Web information space" model, which is composed of Web-based information resources (HTML/XML [Hypertext Markup Language/Extensible Markup Language] documents on the Web), expert advice repositories (domain-expert-specified metadata for information resources), and personalized information about users (captured as user profiles that indicate users' preferences about experts as well as users' knowledge about topics). Expert advice, the heart of the Web information space model, is specified using topics and relationships among topics (called metalinks), along the lines of the recently proposed topic maps. Topics and metalinks constitute metadata that describe the contents of the underlying HTML/XML Web resources. The metadata specification process is semiautomated, and it exploits XML DTDs (Document Type Definition) to allow domain-expert guided mapping of DTD elements to topics and metalinks. The expert advice is stored in an object-relational database management system (DBMS). To demonstrate the practicality and usability of the proposed Web information space model, we created a prototype expert advice repository of more than one million topics/metalinks for DBLP (Database and Logic Programming) Bibliography data set. We also present a query interface that provides sophisticated querying fa-cilities for DBLP Bibliography resources using the expert advice repository.

Improvements in distance-based indexing

Statistical and Scientific Database Management, Jun 21, 2004

This work offers some improvements in the current distance-based indexing techniques. An optimal ... more This work offers some improvements in the current distance-based indexing techniques. An optimal similarity search algorithm that is adopted from vector-based indexing is shown to be also optimal for distance-based indices. Farther similarity between the two types of indexing is revealed, leading to a general description of search structures. A probabilistic analysis of distance-based tree indices is also shown to

On automated lesson construction from electronic textbooks

IEEE Transactions on Knowledge and Data Engineering, Mar 1, 2004

An electronic book may be viewed as an application with a multimedia database. We define an elect... more An electronic book may be viewed as an application with a multimedia database. We define an electronic textbook as an electronic book that is used in conjunction with instructional resources such as lectures. In this paper, we propose an electronic textbook data model with topics, topic sources, metalinks (relationships among topics), and instructional modules which are multimedia presentations possibly capturing real-life lectures of instructors. Using the data model, the system provides users a topic-guided multimedia lesson construction. This paper concentrates, in detail, on the use of one metalink type in lesson construction, namely, prerequisite dependencies, and provides a sound and complete axiomatization of prerequisite dependencies. We present a simple automated way of constructing lessons for users where the user lists a set of topic names (s)he is interested in, and the system automatically constructs and delivers the "best" user-tailored lesson as a multimedia presentation, where "best" is characterized in terms of both topic closures with respect to prerequisite dependencies and what the user knows about topics. We model and present sample lesson construction requests for users, discuss their complexity, and give algorithms that evaluate such requests. For expensive lesson construction requests, we list heuristics and empirically evaluate their performance. We also discuss the worst-case performance guarantees of lesson request algorithms.

A comparison of methods for semantic photo annotation suggestion

ABSTRACT In this work we describe a general framework for semi-automated semantic digital photo a... more ABSTRACT In this work we describe a general framework for semi-automated semantic digital photo annotation though the use of suggestions. We compare context-based methods with Latent Semantic Indexing, a linear algebra approach to information retrieval. Through experiments on real data sets containing up to 13,705 semantically annotated photos, we show that a carefully chosen combination of context-based methods can not only be efficient, but also extremely effective as well. Furthermore, we propose a new combination of context-based methods that outperforms previous work by up to 19% higher recall while running up to 21 times faster.

Efficiently calculating inbreeding on large pedigrees databases

Information Systems, Sep 1, 2009

We consider pedigree data structured in the form of a directed acyclic graph, and use an encoding... more We consider pedigree data structured in the form of a directed acyclic graph, and use an encoding scheme, called NodeCodes, for expediting the evaluation of queries on pedigree graph structures. Inbreeding is the quantitative measure of the genetic relationship between two individuals. The inbreeding coefficient is related to the probability that both copies of any given gene are received from the same ancestor. In this paper we discuss the evaluation of the inbreeding coefficient of a given individual using NodeCodes and propose a new encoding scheme, Family NodeCodes, which is further optimized for pedigree graphs. We implemented and tested these approaches on both synthetic and real pedigree data in terms of performance and scalability. Experimental results show that the use of NodeCodes provides a good alternative for queries involving the inbreeding coefficient, with significant improvements over the traditional iterative evaluation methods (up to 10.1 times faster), and Family NodeCodes further improves this to 77.1 times faster while using 91% less space than regular NodeCodes.

Classification of Large Microarray Datasets Using Fast Random Forest Construction

Journal of Bioinformatics and Computational Biology, Apr 1, 2011

Random forest is an ensemble classification algorithm. It performs well when most predictive vari... more Random forest is an ensemble classification algorithm. It performs well when most predictive variables are noisy and can be used when the number of variables is much larger than the number of observations. The use of bootstrap samples and restricted subsets of attributes makes it more powerful than simple ensembles of trees. The main advantage of a random forest classifier is its explanatory power: it measures variable importance or impact of each factor on a predicted class label. These characteristics make the algorithm ideal for microarray data. It was shown to build models with high accuracy when tested on high dimensional microarray data sets. Current implementations of random forest in the machine learning and statistics community, however, limit its usability for mining over large datasets as they require that the entire dataset remains permanently in memory. We propose a new framework, an optimized implementation of a random forest classifier, which addresses specific properties of microarray data, takes computational complexity of a decision tree algorithm into consideration, and shows excellent computing performance while preserving predictive accuracy. The implementation is based on reducing overlapping computations and eliminating dependency on the size of main memory. The implementation's excellent computational performance makes the algorithm useful for interactive data analyses and data mining.

Synchronized counting method

A direct extension to the counting method is presented which can deal efficiently with both acycl... more A direct extension to the counting method is presented which can deal efficiently with both acyclic and cyclic relations. The extension to cycle cases, called the synchronized counting method, is simulated and studied using a Petri net model. Worst-case analysis shows that n2 semijoin operations are required, where n is the number of nodes in the graph representing the relevant

Genomic pathways database and biological data management

Animal Genetics, Aug 1, 2006

In this paper we discuss the properties of biological data, and challenges it poses for data mana... more In this paper we discuss the properties of biological data, and challenges it poses for data management, and argue that, in order to meet the data management requirements for "digital biology", careful integration of the existing technologies and the development of new data management techniques for biological data are needed. Based on this premise, we present PathCase: Case Pathways Database System. PathCase is an integrated set of software tools for modeling, storing, analyzing, visualizing, and querying biological pathways data at different levels of genetic, molecular, biochemical and organismal detail. The novel features of the system include: (a) genomic information integrated with other biological data and presented starting from pathways, (b) design for biologists who are possibly unfamiliar with genomics, but whose research is essential for annotating gene and genome sequences with biological functions, (c) database design, implementation and graphical tools which enable users to visualize pathways data in multiple abstraction levels, and to pose exploratory queries, (d) a wide range of different types of queries including, "path" and "neighborhood queries", and graphical visualization of query outputs, and, (e) an implementation that allows for web(XML)-based dissemination of query outputs (i.e., pathways data in BIOPAX format) to researchers in the community, giving them control on the use of pathways data.

As more RDF data management systems and RDF data querying techniques emerge, RDF benchmarks provi... more As more RDF data management systems and RDF data querying techniques emerge, RDF benchmarks providing a controllable and comparable testing environment for applications are needed. To address the needs of diverse applications, we propose an application-specific framework, called RBench, to generate RDF benchmarks. RBench takes an RDF dataset from any application as a template, and generates a set of synthetic datasets with similar characteristics including graph structure and literal labels, for the required "size scaling factor" and the "degree scaling factor". RBench analyzes several features from the given RDF dataset, and uses them to reconstruct the new benchmark graph. A flexible query load generation process is then proposed according to the design of RBench. Efficiency and usability of RBench are demonstrated via experimental results.

A graph query language and its query processing

Many new database applications involve querying of graph data. In this paper, we present an objec... more

Topic-Centric Querying of Web Information Resources

Springer eBooks, 2001

This paper deals with the problem of modeling web information resources using expert knowledge an... more This paper deals with the problem of modeling web information resources using expert knowledge and personalized user information, and querying them in terms of topics and topic relationships. We propose a model for web information resources, and a query language SQL-TC (Topic-Centric SQL) to query the model. The model is composed of web-based information resources (XML or HTML documents on the web), expert advice repositories (domain-expert-specified metadata for information resources), and personalized information about users (captured as user profiles, that indicate users' preferences as to which expert advice they would like to follow, and which to ignore, etc). The query language SQL-TC makes use of the metadata information provided in expert advice repositories and embedded in information resources, and employs user preferences to further refine the query output. Query output objects/tuples are ranked with respect to the (expert-judged and userpreference-revised) importance values of requested topics/metalinks, and the query output is limited by either top n-ranked objects/tuples, or objects/tuples with importance values above a given threshold, or both.

Research in Time-and Error-Constrained Database Query Processing