Friday, August 1, 2014

Retrospective on SIGMOD 2014

I just finished serving as the PC Chair of SIGMOD 2014 Conference. We implemented a number of changes this year, so I thought it would be worthwhile to document our experiences and my thoughts. The good news is that SIGMOD continues to experiment with the structure of the conference and SIGMOD Executive continues to be very supportive of these efforts. Hopefully, the conference continues to improve as a result.
In addition to the information I provide here, the conference site has more details.

PC Organization

As in previous years, we had a number of area coordinators to assist with the evaluation of submitted papers. The 13 area coordinators (in alphabetical order) were:
  • Lei Chen (Hong Kong University of Science and Technology, China): Graph management, RDF, and social networks
  • Michael Franklin (University of California, Berkeley, USA): Storage, indexing, and physical database design
  • Alfons Kemper (Technical University of Munich, Germany): Query processing and optimization
  • Laks Lakshmanan (University of British Columbia, Canada): Knowledge discovery, clustering, data mining
  • Ioana Manolescu (Inria Saclay, France): Text databases, XML, keyword search
  • Tova Milo (Tel Aviv University, Israel): Database models, uncertainty, schema matching, data integration, crowd sourcing
  • Elke Rundensteiner (Worcester Polytechnic Institute, USA): Streams, sensor networks, complex event processing
  • Ken Salem (University of Waterloo, Canada): System, performance, transaction processing
  • Dennis Shasha (New York University, USA): Anything that does not fit these areas and papers with which area editors have conflict
  • Divesh Srivastava (AT&T Labs-Research, USA): Spatial, temporal, multimedia and scientific databases
  • Kian-Lee Tan (National University of Singapore, Singapore): Aggregation, data warehouses, OLAP, analytics
  • Patrick Valduriez (Inria Sophia-Antipolis Méditerranée, France): Cloud computing, MapReduce parallel/distributed data management, P2P systems
  • Xiaokui Xiao (Nanyang Technological University, Singapore: Security, privacy, authenticated query processing
The program committee consisted of 126 people -- the list is too long to list here; it is on the conference web site.

Review Process

A major change that we introduced this year was to have two submission cycles - the paper submission deadline for the first cycle was September 16, 2013 and for the second one it was December 10, 2013. For each cycle, we allocated 7 weeks for paper reviews, and 10 days for discussions. Within each cycle, some papers were classified as "revise and resubmit" giving the authors  one month to address reviewer comments and submit a revised version. We had about 4 weeks to review these revised papers followed by a week of discussions. The following figure shows the entire process and the associated numbers (the first number in parentheses is for the first cycle while the second number is for the second cycle).

Sigmod14 review
Overall we had 419 submissions out of which we accepted 107 (more on this below). In the first cycle, each PC member was assigned 2-4 papers. When I allocated papers in the second cycle, I took into account the revise-and-resubmit papers that each PC member was handling from the first cycle. I made sure no one was assigned more than 11 papers in this cycle (thanks to Natassa Ailamaki for suggesting this).
A number of important guidelines that we followed:
  • All directly accepted papers were considered conditional accepts. Authors were asked to address reviewer comments and submit a new version within about two weeks. I and the relevant area coordinated quickly reviewed the final versions of the papers. The purpose of this was to ensure that authors did not ignore the usually very useful reviewer comments once the paper is accepted. It was my experience in recent years that a significant number of author treated paper acceptance as final, and totally ignored the reviewer comments that could significantly improve the paper. Given that we treat conference papers as final, archival publications, I wanted to make sure that the papers would be in the best shape possible.
  • As the above figure shows, some papers went into a third round of minor revision. These were usually presentation edits and minor corrections that the reviewers felt were needed to ensure that the paper is SIGMOD quality. Some reviewers even took the time to annotate the paper for presentation fixes that we forwarded to the authors. This was a substantial load  on the reviewers and over what would normally be expected. I very much appreciated the diligence of these reviewers.
  • I asked the reviewers to balance two functions: (1) put together an exciting and broad technical program, and (2) provide meaningful feedback to authors to assist in getting their work published. My point was that as Program Committee members, we do have a "gate keeper" role, but that should be balanced against our responsibility to the community to ensure that worthwhile papers are improved to become publishable. My strong belief is that we do enough good work in this community that should see the light of day with proper guidance
  • As always, I asked the reviewers to provide meaningful reviews. In particular I asked them to refrain from comments like "This has been done before" or "There are not enough experiments". The first comment requires references to be meaningful, while the second one requires explanation of what is needed and why. Almost any paper can have more experiments; the question is whether the paper is acceptable without these additional experiments. Another guideline I provided was to be very careful in declaring that a paper is not suitable for SIGMOD; while we need to make sure that the conference is internally consistent, we don't need to be patronizing to the authors who have selected to submit their works to SIGMOD - we need a balance here.  Finally, I asked them to avoid comments like "I am not excited by this paper" -- if we only accepted papers that I am excited about, we may only have a SIGMOD conference once every couple of years. The bottom line: focus on the technical content of the paper and decide whether it advances our understanding.
  • It is inevitable that there were a number of sub-par reviews -- very short (some even single line) and not very informative. I tried to track the reviews as closely as I could and I asked the owners of these sub-par reviews to improve them. To their credit almost all of them did. However, I did delete a few reviews that were so poor that I decided it was better for the authors not to see them -- I just could not get the reviewers to update them.
  • For revise-and-resubmit papers, I asked the reviewers not to pre-judge what authors can do in the allocated time of one month; I told them to just list what needs to be done for the paper to be acceptable, and leave it to the authors to decide whether or not they can do it in that time. I asked the reviewers to be reasonable in asking what they ask for -- again, each paper can be improved, but we are not looking for perfect papers, we are looking for acceptable papers. 
  • I instructed the reviewers that the evaluation of revised versions should only be based only on the explicitly stated as requirements to the authors. The point of this is that, normally, we should not be raising new issues that the authors do not have a chance to respond. This is a fundamental aspect of journal reviewing and I wanted us to follow the same principle. Of course, if we all of a sudden discover a major flaw in the paper while reviewing the revised paper, we can and should reject the paper, but this did not happen.
Basically, we followed a review process that is quite similar to journal reviewing, We could have improved things considerably (more on that below), but I was generally satisfied with the results. 

Submission and Acceptance Statistics

Research Paper Track

As I noted above, we had 419 submissions to the research paper track of SIGMOD 2014. The paper submissions to SIGMOD were showing slight decline in recent years and we managed to arrest that and bring the submission number close to its traditional value of 425-450. The following figure shows the submission numbers and the acceptance ratios over the recent years.

Sigmod14 submissions

The submissions this year were healthy and manageable. The acceptance rate is in the upper end of what I consider to be the range we should be targeting: 20-25%. Incidentally, Program Committee members asked me repeatedly at the beginning of the process what our quota was and where we were with respect to that quota. My response was that I did not want them to worry about or focus on a quota, that they should simply focus on each paper and decide whether it was acceptable, and that we would find a way fit the accepted papers into a program. Furthermore, since we used a multi-cycle submission process, it was not possible to do very detailed planning anyway. In the end, we accepted 30 papers more than last year and we were able to accommodate all of them by reducing the presentation time to 25 minutes (including the Q&A session).
As always, the distribution of the papers to the 13 areas were not uniform. The following figure shows the distribution of submissions based on the first area that the authors have indicated. We managed this skewed distribution by being flexible in assigning papers to area coordinators each of whom are researchers who could handle more than one area.

SIgmod14 areas

I did not analyze the geographic distribution of the submitted papers, but the distribution of accepted papers were as follows: USA 56, China 11, Switzerland and Singapore 7, Hong Kong 6, Germany 4, India and Japan 3, Israel and UK 2, Australia, Austria, Canada, France, Italy, and Korea 1.
Finally, I looked at some paper-specific statistics. The following figure shows the distribution of the number of authors of the papers as well as the number of countries and institutions represented. These are statistics for accepted papers.

Sigmod14 author dist

It is not surprising that there were no single-authored paper -- there almost never is in SIGMOD. Most of the papers have 3-4 authors. The following table provides the mean and median numbers for these.
Sigmod14 author table

Industrial Paper Track

The Industrial Program track was chaired by Fatma Özcan (IBM Almaden) & Nesime Tatbul (Intel Labs & MIT). They were assisted by 13 PC members. This track received 44 submissions, out of which 15 were accepted, resulting in an acceptance rate of 34%. The accepted industrial papers were also treated as conditional accepts and were shepherded.The distribution of the papers to areas as well as the final decisions are shown in the following figure.

Sigmod14 Industrial

Rounding Out the Technical Program

The Technical Program consisted of 107 research and 15 industrial papers, two keynotes, two panels, and four tutorials. The research and industrial papers were presented in poster sessions over two evenings. This year PODS also included their papers in the poster sessions.
  • Keynotes were selected by Gustavo Alonso. The two keynotes, How I Learned to Stop Worrying and Love Compilers by Eric Sedlar of Oracle Labs and Fun with Hardware Transactional Memory by Maurice Herlihy of Brown University were very outstanding and I heard nothing but good comments.
  • Panel Chairs were Susan Davidson and Sunita Sarawagi. They organized a panel on Should we all be teaching “Intro to Data Science” instead of “Intro to Databases”. In addition, Fatma and Nesime organized an industrial panel on Are We Experiencing a Big Data Bubble?
  • TutorialsChris Jermaine and Yufei Tao, assisted by 8 PC members, selected four tutorials out of 12 submissions.
  • Demonstration Chairs: were Bettina Kemme and Wolfgang Lehner. With the assistance of 53 PC members, they selected 29 demonstrations out of 75 submissions. Demonstrations were grouped into three sessions and each were repeated twice. They also organized the selection of the best demo in each group.
  • Undergraduate Research Program was chaired by Mario Nascimento and Anastasios Kementsietsidis. Out of 18 submissions, they selected 7 for poster presentation.
  • As usual, we had a New Researcher Symposium that was chaired by Alexandra Meliou and Anish Das Sarma.

What Worked and What Would I do Differently

With all the experimentation, I think it is a good idea to document what worked and what I would do differently if I were to do it again. I think the following worked very well:
  • Double-blind is working very well and we should maintain it. There were only two cases where authors wondered how to position the paper without revealing their previous work, and we were able to handle these easily. Our community has now accepted and adjusted to double-blind, and it is working well.
  • Considering accepted papers as conditional accept worked very well -- it added a bit more work for the authors (about two weeks), the area coordinators and I, but the resulting papers were in much better shape.
  • Two submission cycles was a great idea. It gave everyone a chance to get the papers into a more reasonable shape for submission. I am convinced that it played a significant role in the increase of paper submissions. I would actually add a third cycle. However, we should recognize that the process is now spread over a longer period of time.
If I were to do this again, here are some changes I would do:
  • I would give PC members three options for paper decisions: Accept, Reject, Revise-and-resubmit. In the end, we wish to classify the papers into these categories anyway, and having too many categories (Strong Accept-Accept-Weak Accept-Weak Reject-Reject-Strong Reject) is not helpful. PC members do not use the full spectrum anyway; a large majority of the papers are categorized as Weak Accept or Weak Reject, so these papers "in the middle" form a large equivalence set, and we spend a ton of time trying to sort these out. Here are some statistics from the first cycle that demonstrate the point:
    • Number of Strong Accept reviews (out of ~300 reviews): 1
    • Number of Accept reviews (out of ~300 reviews): 21
    • Number of papers with at least one Accept/Strong Accept: 20
    • PC members who have rejected every paper in their batch: 8
  • I would reduce review time from 7 weeks to 5 and increase the discussion time to 4 weeks. I have two reasons for suggesting this:
    • PC members really fall into two categories: those who do their reviews very early in the process, and those who procrastinate forever. In the second cycle, only 60% of the reviews were submitted one week before the deadline. In case  you think that people were doing their reviews and were uploading at the last minute, I would note that only 80% of the reviews were in when the deadline passed. It took over a week into the discussion period (and many emails) for us to get all the reviews. It appears to me that shortening the period will not have a major impact on the behaviour of either of these groups.
    • Extended discussion period is useful not only for the reviewers to have fuller discussions, but, more importantly, it gives the PC chair more time to go over the reviews and address deficiencies. I tried to read as many of the reviews as time permitted and asked colleagues to improve their reviews. I also participated in the discussions on some papers. However, time was an issue and a longer discussion period would have allowed me to be more engaged.
  • One thing that is not working well is online discussions. Some PC members consider their job done when they submit their reviews and no amount of encouragement would get them to participate in the discussion. I am not sure what to do about this, but it is an issue that we need to address. Right now, online discussions are not doing the job. Perhaps it would be a good idea to have a face-to-face meeting of the area coordinators; that would be an improvement.

In the end...

it was a very enjoyable experience. I have now served as PC Chair of all three major database conferences (VLDB in 2004, ICDE in 2007, and SIGMOD in 2014), and each one is very different. I tried something new in each of these, and some ideas were worthwhile while others were improved upon by others.

Friday, December 27, 2013

ACM Books to Launch | December 2013 | Communications of the ACM

ACM Books to Launch | December 2013 | Communications of the ACM

ACM has launched a book program and I have agreed to be the Founding Editor-in-Chief. Read more about the series in this CACM editorial. You can reach the home page of the series by clicking ACM Books. Suggestions for books are very welcome.

Tuesday, July 17, 2012

Computer science publication culture: where to go from here?

I have written a blog for ACM SIGMOD on the computer science publication culture. The blog is here. My main thesis is that  in the long run, we will follow other science and engineering disciplines and start treating journals as the main outlet for disseminating our research results. I outline some of the steps that we can take in getting there from where we are today. I would love to hear of opinions either here or at the ACM blog.

Friday, March 4, 2011

Principles of Distributed Databases - Third edition is out, finally!...

cda_displayimage.jpg The third edition is finally out... It has been ten years since the release of the second edition -- it took a while, but we are very happy with the results. We actually started the revision back in 2005 hoping to finish it by 2006, but, as usual, the plans met the reality of many other commitments on both of our parts.

The book is almost a complete re-write. We kept the fundamental principles that have been there since the first edition, but they are updated. The end result is a book that has been heavily revised -- while we maintained and updated the core chapters, we have also added new ones. The major changes are the following:
  1. Database integration and querying is now treated in much more detail, reflecting the attention these topics have received in the community in the past decade. There is one chapter that focuses on the integration process, while another chapter discusses querying over multidatabase systems.
  2. The previous editions had only brief discussion of data replication protocols. This topic is now covered in a separate chapter where we provide an in-depth discussion of the protocols and how they can be integrated with transaction management.
  3. Peer-to-peer data management is discussed in depth. These systems have become an important and interesting architectural alternative to classical distributed database systems. Although the early distributed database systems architectures followed the peer-to-peer paradigm, the modern incarnation of these systems have fundamentally different characteristics, so they deserve in-depth discussion in a chapter of their own.
  4. Web data management is covered in one chapter of its own. This is a difficult topic to cover since there is no unifying framework. We discuss various aspects of the topic ranging from web models to search engines to distributed XML processing.
  5. Earlier editions contained a chapter where we discussed "recent issues" at the time. In this edition, we again have a similar chapter where we cover stream data management and cloud computing. These topics are still in a flux and are subjects of considerable ongoing research. We highlight the issues and the potential research directions.
The resulting manuscript strikes a balance between our two objectives, namely to address new and emerging issues, and maintain the main characteristics of the book in addressing the principles of distributed data management.

The third edition is coming out at a time when there is renewed interest in distributed data management. The last ten years have seen an accelerated investigation of distributed data management technologies spurred by advent of high-speed networks, fast commodity hardware, very heavy parallelization of hardware, and, of course, the increasing pervasiveness of the web. Patrick and I are holding a panel session at the upcoming ICDE 2011 conference on this topic. The objective is to discuss what is likely to happen in the next decade; or to put it differently, if there were to be a fourth edition of our book in 2020, what would it be? What would be new? We'll see what emerges as the important trends. I'll report.

The book is available from Springer, Barnes & Noble, Chapters-Indigo (in Canada), and, of course, Amazon. Springer site will (eventually) have presentation slides, and solutions to selected exercises -- we are working on them right now.

Sunday, February 6, 2011

J.C.R. Lickider and the early days of computing

I just finished reading The Dream Machine by M. Mitchell Waldrop (not the 1991 movie...). It is a biography of J. C. R. Licklider, but it is much more than that - it is the story of the very early days of computing in the US starting in the 1950s. J. C. R. Licklider, or Lick as he apparently preferred to be called, started his career at Harvard in the Psycho-Acoustic Laboratory in 1943 after receiving his PhD at University of Rochester on that very topic. During his time at Harvard, he started attending the famous "supper seminars" organized by Norbert Weiner (who was a distinguished mathematician and is the father of the cybernetics movement). One of the problems debated at these seminars was the relationship of digital computers and the human brain. Thus started Lick's interest in computing, which shaped the rest of his life. In 1950 he moved to MIT with the promise of setting up a cognitive psychology research program and a department of psychology. He did set up a top-notch and influential program, but he could not realize the objective of setting up a department due to institutional obstruction. He moved to BBN in mid-1957 as Vice-President in charge of all psycho-acoustics research. He moved to ARPA in 1962 to head the Information Processing Techniques Office (IPTO) where he stayed until 1964. He then moved to IBM for a short while and then returned to MIT in 1968 from where he retired in 1985. He passed away in 1990.

His academic career, as it relates to computing, is very interesting and it is eye opening to read some of his papers. After I finished the book, I read his 1961 paper "Man Computer Symbiosis" and his 1968 paper co-authored with Bob Taylor (who himself became the head of IPTO later on, and is one of the fathers of the ARPANET, "The Computer as a Communication Device", both of which were included in a 1990 DEC Technical Report in memory of Lick shortly after he passed away. His vision of where computing should go, in particular his emphasis on moving from a computing paradigm based on well-defined specification (and coding) of a solution supported by batch processing to one where the system "works" with the users and "learns" along the way, and is supported by timesharing (and later interactive) computing, is very enlightening when considered in historical perspective.

Lick's ARPA days were perhaps far more influential on the growth of computing in the US. He was influential in initiating and funding projects at a few key institutions on timesharing (Project MAC at MIT, and Ed Feigenbaum at UC Berkeley), AI (again Project MAC and Marvin Minsky at MIT, Allen Newell, Herbert Simon, and Alan Perlis's work at CMU,John McCarthy's work at Stanford), human-computer interaction (Doug Engelbart's group at SRI), and he started the work on ARPAnet. He had explained his ideas of an "intergalactic computer network" in a series of memos in 1961 while he was at BBN. These ideas are also summarized in the 1968 essay "The Computer as a Communication Device". The book is very well researched and very nicely written. It is Lick's life that forms the backbone of the book, but that is not constraining at all given Lick's impact on so many areas. The projects that he funded are very well described. The projects and efforts that grew out of these early projects (such as Xerox PARC) are also included to complete the narrative.

When I completed the book, I kept thinking that new generation of students should be exposed to the history of computing in some way. There is significant value in being able to see the thread of ideas from their early germination to their later realization (sometimes decades later). I believe it would be better to weave the discussion of history into the discussion of fundamental techniques and algorithms. This requires a rethinking of how we introduce computer science -- especially in the early courses -- but that is a topic for another blog.