Sachindra Joshi

Followers

Following

Co-authors

Public Views

Hugo Hidalgo

Andrew McCallum

University of Massachusetts Amherst

Heilbronn University

Marcelo Luis Errecalde

UNSL Universidad Nacional de San Luis

Jean-Michel Renders

Rahul Kumar

InterestsView All (7)

Uploads

Papers by Sachindra Joshi

Classification of Text Documents Based on Minimum System Entropy

Icml, 2003

In this paper, we describe a new approach to classification of text documents based on the minimi... more In this paper, we describe a new approach to classification of text documents based on the minimization of system entropy, i.e., the overall uncertainty associated with the joint distribution of words and labels in the collection. The classification algorithm assigns a class label to a new document in such a way that its insertion into the system results in the maximum decrease (or least increase) system entropy. We provide insights into the minimum system entropy criterion, and establish connections to traditional naive Bayes approaches. Experimental results indicate that the algorithm performs well in terms of classification accuracy. It is less sensitive to feature selection and more scalable when compared with SVM.

Download

Systems and Methods for Standardization and De-Duplication of Addresses Using Taxonomy

E-Mail Thread Hierarchy Detection

Methods, apparatus and computer programs for characterizing web resources

Electronic mail duplicate detection

Methods, apparatus and computer programs for evaluating and using a resilient data representation

Provided are methods, apparatus and computer programs for evaluating the resilience, to structura... more Provided are methods, apparatus and computer programs for evaluating the resilience, to structural changes in a data source, of a representative label representing a data element within the data source. Also disclosed are applications using a resilient representative label. For example, a representative label may represent a particular data field or other data element within a semi-structured data source—such as within XML or HTML Web pages. An estimate of resilience to changes can be used to determine whether a candidate ...

Finding Influential Authors in Brand-Page Communities

Enterprises are increasingly using social media forums to engage with their customer online- a ph... more Enterprises are increasingly using social media forums to engage with their customer online- a phenomenon known as Social Customer Relation Management (Social CRM). In this context, it is important for an enterprise to identify 'influential authors' and engage with them on a priority basis. We present a study towards finding influential authors on Twitter forums where an implicit network based on user interactions is created and analyzed. Furthermore, author profile features and user interaction features are combined in a decision tree classification model for finding influential authors. A novel objective evaluation criterion is used for evaluating various features and modeling techniques. We compare our methods with other approaches that use either only the formal connections or only the author profile features and show a significant improvement in the classification accuracy over these baselines as well as over using Klout score.

Download

$Research paper thumbnail of Amino acid compositions of different protein fractions in developing grains of NP 113 barley and its high lysine Notch-2 mutant$

Amino acid compositions of different protein fractions in developing grains of NP 113 barley and its high lysine Notch-2 mutant

Plant foods for human nutrition (Dordrecht, Netherlands), 1988

The percent distributions of protein fractions namely albumin + globulin, prolamine and glutelin ... more The percent distributions of protein fractions namely albumin + globulin, prolamine and glutelin were studied in developing grains of NP 113 barley and its high lysine mutant Notch-2. During development the percentage of albumin + globulin fraction decreased in NP 113, while those of prolamine and glutelin remained unchanged. The increase in prolamine was substantial from 24 to 31DAA. In Notch-2 the trend followed by albumin + globulin and prolamine was like that in NP 113, while the glutelin fraction showed an increase as compared to 10 DAA. The percent of albumin + globulin was slightly higher in Notch-2 as compared to NP 113. The absolute amount (mg/grain) of all the protein fractions increased during development in both NP 113 and its mutant Notch-2. During the grain development the prolamine content was substantially lower in the mutant than in the parent NP 113. The albumin + globulin content per endosperm was in general also higher in NP 113 than Notch-2. Amino acid analysis ...

Improve control with software monitoring technologies

Multiple linear regression, principal component analysis, partial least squares, polynomial regre... more Multiple linear regression, principal component analysis, partial least squares, polynomial regression and artificial neural networks are popular techniques for process modeling. An industrial case study illustrates some of these technologies, with an emphasis on artificial neural networks. Experience with this and other projects indicates that while neural network models, combined with partial least squares when necessary, are an excellent tool for modeling, linear techniques may also be appropriate in some cases. Regardless of the specific method used, software analyzers are an attractive lower-cost alterative to hardware options in some monitoring applications. From a fundamental point of view, the result of chemical analysis can be considered as the dependent variable(s) of a process system having a number of independent variables. The independent variables are the causes and the chemical analysis is the effect. If the cause-and-effect relationship between the inputs and output were known, then knowledge of the inputs would be sufficient to predict the output(s) values reliably, thus circumventing the problems with the experimental methods. This is the alternate approach to performing chemical analysis. Note that experimental data on the outputs would be required for model development. In some cases, the predicted values may be used as the primary source of analysis. In others, the predicted values may help eliminate time delays associated with the experimental analyzer systems. The predictions may be reinforced by experimental results when they become available. The paper discusses modeling techniques, neural network modeling, and a vinyl acetate application.

Tu1034 Risk of Development of Hepatocellular Carcinoma in Patients With NASH Related Cirrhosis

of HCV, HBV or alcohol use) compared with primary risk factors. Logistic regression was used to e... more of HCV, HBV or alcohol use) compared with primary risk factors. Logistic regression was used to examine the association between idiopathic HCC, stage at diagnosis, and treatment receipt. Cox proportional hazards analyses were conducted to assess the effect of idiopathic HCC on the risk of mortality. All analyses were adjusted for demographic and clinical features of HCC. Results: We identified 1200 HCC patients. Approximately 67% had a prior HCV diagnosis, 4% had HBV, and 41% had alcohol use (categories not mutually exclusive) while 22% had idiopathic HCC. A significant increase in the proportion of patients with idiopathic disease was observed over time from 18% in 2004-05 to 35% in 2009-11 (p= 0.03). Among patients with idiopathic HCC, 60% had diabetes or NASH/NAFLD. Patients with idiopathic HCC were significantly less likely to receive HCC surveillance than those with primary risk factors (p,0.01). Compared to patients with primary risk factors, those with idiopathic HCC were 60% more likely to be diagnosed with advanced HCC (OR=1.6; 95%CI:1.2-2.1) and 40% less likely to receive HCC-specific treatment (OR=0.6; 95%CI:0.4-0.8). However, median survival was similar in both groups (HR=0.9; 95%CI:0.8-1.2). Conclusion: In the United States, almost 25% of HCC cannot be explained by HCV, HBV, or alcohol use. However, 60% of patients with idiopathic HCC had pre-existing diabetes or NASH/NAFLD. Patients with idiopathic HCC were less likely to receive HCC surveillance, diagnosed at a more advanced stage, and less likely to receive treatment. Expanding the use of HCC surveillance to those with diabetes or NASH/NAFLD should be considered.

Download

Effective Use of Waste Genererated in Thermal Power Plant as a Value Added Product in Cement Manufacturing and Its Environmental Impact

Energy recovery from solid waste in cement rotary kiln and its environmental impact

Mining of generalized disjunctive association rules

Method and apparatus for populating a predefined concept hierarchy or other hierarchical set of classified data items by minimizing system entrophy

Mining generalised disjunctive association rules

Proceedings of the tenth international conference on Information and knowledge management - CIKM'01, 2001

Download

Search result summarization and disambiguation via contextual dimensions

Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006

Topic hierarchies are a popular method of summarizing the results obtained in response to a query... more Topic hierarchies are a popular method of summarizing the results obtained in response to a query in various search applications. However, topic hierarchies are rigid when they are pre-defined and somewhat unintuitive when they are dynamically generated by statistical techniques. In this paper, we propose an alternative approach to query disambiguation and result summarization by placing the results in set

Data Cleansing Techniques for Large Enterprise Datasets

2011 Annual SRII Global Conference, 2011

... K Hima Prasad , Tanveer A Faruquie, Sachindra Joshi, Snigdha Chaturvedi, L Venkata Subramania... more

RAD: A Scalable Framework for Annotator Development

2008 IEEE 24th International Conference on Data Engineering, 2008

Developments in semantic search technology have motivated the need for efficient and scalable ent... more Developments in semantic search technology have motivated the need for efficient and scalable entity annotation techniques. We demonstrate RAD: a tool for Rapid Annotator Development on a document collection. RAD builds on a recent approach [1] that translates entity annotation rules into equivalent operations on the inverted index of the collection, to directly generate an annotation index (which can be used in search applications). To make the framework scalable, we use an industrial strength indexer, Lucene [2] and introduce some modifications to its API. The index also serves as a suitable representation for making quick comparisons with an indexed ground truth of annotations on the same collection to evaluate precision and recall of the annotations. RAD achieves at least an order of magnitude speedup over the standard approach of annotating a document-at-a-time as adopted by GATE [3]. The speedup factor increases with increase in the size of the collection, making RAD scalable. We cache intermediate results from the index operations, enabling quick update of the annotation index as well as speedy evaluation when rules are modified. This makes RAD suitable for rapid and interactive development of annotators.

Download

Automatic categorization of web sites based on source types

Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, 2004

... Types Shourya Roy IBM India Research Lab Block 1, IIT Delhi, Hauz Khas New Delhi, 110016, IND... more

Classification of Text Documents Based on Minimum System Entropy

Icml, 2003

Download

Systems and Methods for Standardization and De-Duplication of Addresses Using Taxonomy

E-Mail Thread Hierarchy Detection

Methods, apparatus and computer programs for characterizing web resources

Electronic mail duplicate detection

Methods, apparatus and computer programs for evaluating and using a resilient data representation

Finding Influential Authors in Brand-Page Communities

Download

$Research paper thumbnail of Amino acid compositions of different protein fractions in developing grains of NP 113 barley and its high lysine Notch-2 mutant$

Amino acid compositions of different protein fractions in developing grains of NP 113 barley and its high lysine Notch-2 mutant

Plant foods for human nutrition (Dordrecht, Netherlands), 1988