 |
|
|
|
|
Research Interests
- Online Crowdsourcing
- Econ-based Data Mining
- Web-based Behavioral Experiments
- Data Cleaning and Data Integration
|
|
Current Projects
-
Managing Crowdsourcing Workers
The emergence of online crowdsourcing services (like Amazon's Mechanical Turk) presents us huge opportunities to distribute micro-tasks at an unprecedented rate and scale. Unfortunately, the inaccurate verification mechanism and unstable employment relationship give rise to opportunistic behaviors of workers, which in turn exposes the requesters to quality risks. we present an algorithm to estimate the quality of the workers by redundancy, which can easily separate the true (unrecoverable) error rate from the (recoverable) biases. Also, we can seamlessly integrate the existence of "gold" data for learning the quality of workers. In addition, we are trying to bring up an active learning approach for testing worker quality using gold label or imperfect label.
-
An Experimental Study of Cooperation in Evolving Social Networks
We study how cooperation emerges in a dynamic network environment where individuals are able to choose their actions and change who they are interacting with.
-
Paid and Unpaid Reviews
We plan to examine whether "paid" user‐generated content has any different characteristics compared to content generated by "unpaid" volunteers. If so, are consumers aware of that? The experiment is currently running on Amazon Mechanical Turk.
|
|
Past Projects
-
Estimating the Completion Time of Crowdsourced Tasks
We model the completion time as a stochastic process and build a statistical method for predicting the expected time for task completion. We use a survival analysis model based on Cox proportional hazards regression. Our model show how time-independent variables of posted tasks (e.g., type of the task, price of the HIT, day posted, etc) affect completion time.
-
Continuing work on Sheng et al's paper " Get
Another Label? Improving Data Quality and Data Mining Using
Multiple, Noisy Labelers". With low-cost labeling driven by online micro-outsourcing systems such as Amazon's Mechanical Turk, it is possible for us to improve both data quality and model quality by repeated labeling within budgeted cost. In previous work, we've seen great performance of several selective strategies on data accuracy. Recently, we studied a novel label uncertainty estimation strategy ( NLU) by taking the underlying class distribution into account. Also, we update the old LMU strategy by integrating NLU with MU. Moreover, we attempted to improve the classification accuracy in two aspects:
- Soft Labeling. Although the old majority voting worked well, the uncertainty differed across instances. Soft-labeling assigned two copies (one positive, one negative) to each single instances, with weights equal to the certainty score of the particular selective repated-labeling strategy.
- Weighted Sampling. In previous study, each instance was chosen with abosolute priority. Here, the probability of being chosen is propotional to the uncertainty score of each instance under the particular selective strategy.
-
Exploring the negative correlation between labelers. For most cases, we assume that labelers are independent, but this is neither true nor helpful for ensuring higher accuracy. There are opportunities as well as chanllenges on infering the correlation between labelers.
- Benefit Side. Negative correlation between labelers (as long as the quality is good than random) can give us better results than pure independence.
- Cost Side. Higher results appears when the quality is high, the number of labelers is big, and the examples per labelers is relatively small, which are irrelevant with the true correlation.
-
The realtionship between the payment to turkers and the outcomes. I designed three experiments: judging the emotion of the person in the image[ See an example], rating movie reviews[ See an example], and solving the chess puzzles[ See an example]. The results confirm the claim that higher payment will induce better performance. Also, we know from this that turkers complete online micro-outscouring work not only for fun and challenge, but also for money.
- Predicting Financial Data [See poster]
This was a course project in Machine Learning.
The dataset consisted of description vectors of various companies, together with a variable that indicated whether the company defaulted on their loans. Given the descriptions of each company except the dependent variable, my goal was to increase the prediction accuracy. The dataset is highly incomplete (with lots of missing values) and unbalanced. So I employed a 2-layer neural network and latent variable inference model to do the classification.
- Designing Approximate Algorithms for k-Anonymity [See paper]
Data based privacy preservation has become a hot research topic in recent years. One attack method is the linking attack which joins the published data with other data on some attributes and reveals the sensitive information. To protect privacy against this attack, the notion of k-anonymity which makes each record in the table indistinguishable with at least k-1 other records has been proposed.
In this work, we proposed a summary itemset based k-anonymity algorithm. Experimental results showed that our algorithm could achieve similar approximation ratio in shorter running time. We also expanded the notion of k-anonymity to sequence data and devise a summarization subsequence based k-anonymity algorithm.
|
|
Publications
- Jing Wang, Panagiotis G. Ipeirotis, Foster Provost. Managing Crowdsourcing Workers. The 2011Winter Conference on Business Intelligence , Utah, March10-12, 2011.
- Jing Wang, Siamak Faridani, Panagiotis G. Ipeirotis. Estimating the Completion Time of Crowdsourced Tasks Using Survival Analysis Models. Intl. Conf. on Web Search and Data Mining (WSDM) Workshop on Crowdsourcing for Search and Data Mining (CSDM), Hong Kong, China, Feb 9, 2011.
- Panagiotis G. Ipeirotis, Foster Provost, Jing Wang. Quality Management on Amazon Mechanical Turk. Proceedings of the ACM SIGKDD Workshop on Human Computation (HCOMP) , 2010.
- Charu C. Aggarwal, Yan Li, Jianyong Wang, Jing Wang. Frequent Pattern Mining with Uncertain Data. Proc. the 15th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. PP: 29-37.
- Li Zheng, Yintao Liu, Jing Wang, Fang Yang: Multiple Standards Compatible Learning Resource Management. ICALT 2008: 657-661
|
|
Working Papers
- Repeated Labeling Using Multiple Noisy Labelers (With Panagiotis G. Ipeirotis, Foster Provost, and Victor Sheng), Under Review
- Managing Crowdsourcing Workers (With Panagiotis G. Ipeirotis, and Foster Provost)
- Cooperation and assortativity with dynamic link formation (With Siddarth Suri, and Duncan J. Watts)
|
|
Awards
- Doctoral Student Fellowship, Leonard N. Stern School of Business, New York University, Aug. 2008-present.
- Scholarship for Excellent Thesis, Dept. of Computer Science & Technology, Tsinghua University, June 2008.
- Tsinghua Outstanding Graduates, Tsinghua University, July 2008
- Kai Feng scholarship for comprehensive excellence, Tsinghua University, Oct. 2007
- First-class comprehensive excellence scholarship, Tsinghua University, Oct. 2006
- National Scholarship, Tsinghua University, Oct. 2005
- First-Level Certificate for National Abacus and Mental Arithmetic Test, 1997
|
|
| |