Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics.

At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community's leaders gathered to share hard-won lessons from the trenches.

Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid.

Presents the wisdom of community experts, derived from a summit on software analytics
Provides contributed chapters that share discrete ideas and technique from the trenches
Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data
Presented in clear chapters designed to be applicable across many domains

Frequently asked questions

Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes, you can access Perspectives on Data Science for Software Engineering by Tim Menzies,Laurie Williams,Thomas Zimmermann in PDF and/or ePUB format, as well as other popular books in Computer Science & Databases. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Morgan Kaufmann

Year

2016

ISBN

9780128042618

Topic

Computer Science

Subtopic

Databases

Index

Computer Science

Wisdom

Log it all?

G.C. Murphy University of British Columbia, Vancouver, BC, Canada

Abstract

Imagine a group of blind women are asked to describe what an elephant looks like by touching different parts of the elephant (Wikipedia, Blind men and an elephant). One blind woman touches the trunk and says it is like a tree branch. Another blind woman touches the leg and says it is like a column. The third and fourth blind women touch the ear and tail, saying the elephant is like a fan and a rope, respectively.

Keywords

Integrated development environment (IDE); Monitoring tools; Data capture; Phenomenon of interest

A Parable: The Blind Woman and an Elephant

Imagine a group of blind women are asked to describe what an elephant looks like by touching different parts of the elephant [1]. One blind woman touches the trunk and says it is like a tree branch. Another blind woman touches the leg and says it is like a column. The third and fourth blind women touch the ear and tail, saying the elephant is like a fan and a rope, respectively.

Each of these women believes she is right, and, from her own perspective, each is right. But, each decision is being made on only partial data and tells only part of the entire story.

Misinterpreting Phenomenon in Software Engineering

When we collect data to investigate a software engineering phenomenon, we have to take care not to fall into the trap of taking a single perspective. Let’s consider that we want to know more about the phenomenon of how software developers learn commands available in an integrated development environment (IDE). Understanding what commands are being used, and how and when new commands are learned, might help to develop tools to suggest other useful commands developers are not already using as a means of improving the developer’s productivity (eg, Ref. [2]).

To investigate this phenomenon, we decide to monitor each and every command the developer executes as she works. As part of the logged information, we capture the time at which the command was executed, the command itself, and how the command was invoked. We try out the monitoring tool with those close to us and the data all seems to make sense. We then deploy a monitoring tool and start gathering the information from as many developers as we can convince to install the monitor.

As we gather the logged command data from (hopefully thousands of) developers and begin the analysis, we realize that there are many different subsequences of command usage we did not initially anticipate. For example, we may be surprised to see searches for references to uses of methods interspersed with debugging commands, such as step forward.

It is only at this point in our analysis when we realize that we are missing data about the context in which commands are executed, such as which windows were visible on the screen in the IDE when the command was executed. Had we captured this information, we could infer from the visible windows whether a command is likely being executed while debugging or while writing code. Understanding more about this context could enable us to build better recommender tools.

We had inadvertently fallen into the trap of looking at the phenomenon from only one perspective.

Using Data to Expand Perspectives

As the example demonstrates, capturing more data and making it available for analysis can help us expand our perspective on a software engineering phenomenon.

Imagine if we were now able to not just capture the windows visible when a command is executed, but we also knew which task on which the developer was working, such as the particular feature being developed. With this data, we might be able to start understanding which commands are most applicable for different kinds of tasks. And what if we could capture the developer’s face as the tasks were performed? This data might help identify which commands require more concentration than others to execute and might change the sequence or situations in which a tool recommends particular commands to learn.

Although we might want to capture many kinds of data to understand a phenomenon, is more data always better? In determining what data to capture, we must consider a number of issues:

• Can the data be captured practically? Although we might want to capture video of a developer’s face using an IDE, and although cameras for web conferencing are becoming more readily available, it is not practical to capture this data for a large number of developers working in a variety of environments.

• Will the developers accept the capture and use of the data? Not every developer is likely to feel comfortable with video of their face being captured as it raises many privacy concerns.

• What are the constraints on transmission and sharing of the data? Some data that might be captured might need to be protected because transmission or sharing of that data with the researchers might comprise intellectual property. In some cases, it may be possible to transform the data in such a way that alleviates the concerns about transmission and sharing. In other cases, there may not be a meaningful transform. Researchers must carefully consider how to ensure the data meets any constraints of the environment in which it is collected.

• Can the data be captured at a cost proportional to its likely value? Not all data can be captured at a reasonable cost to either the developer, or the user, or both. The potential benefits of how the data might influence the study of the software engineering phenomenon must be weighed against its collection costs.

Recommendations

We have described that collecting more data can help lead to a more accurate interpretation of a phenomenon of interest. We have also described that there are constraints on simply gathering data about everything happening in a situation of interest.

How can a researcher decide what data to collect, when? Here are some recommendations:

1. Decide on the minimal data you need to make progress on the problem of interest. Make sure you can collect at least this data.

2. Consider what other data it is feasible to collect and try to collect any and all data that is practical and cost-effective to collect and that will not violate privacy or security concerns of those in the environment in which the data will be collected.

3. Think through the analysis and/or techniques and tools driving the collection of the data. Ask questions about how whether the data intended to be collected is sufficient for the analysis or technique and tool creation.

4. Think about the threats to any analysis and/or techniques and tools driving the collection of the data. How might the data mislead?

5. Iterate across each step until you are certain that you have the right balance between data collection concerns and ensuring as accurate a view of a phenomenon as possible.

Above all, make sure your data doesn’t lead you to miss the elephant for one of its parts.

Why provenance matters

M.W. Godfrey University of Waterloo, Waterloo, ON, Canada

Abstract

We are increasingly seeking to extract more and more information from our software development artifacts to infer high-level understanding of our products and the processes that create them. However, this begs many questions about the artifacts, their inter-relatedness, their histories, and the quality of the information that can be extracted. That is, the provenance of the artifacts must be studied to be able to answer these kinds of questions; this chapter explores the notion of software artifact provenance.

Keywords

Provenance; Key entities; Defining and scoping the entities of interest; Artifact linkage and ground truth; Scalable matching algorithms

Here’s a problem: You’re a lead developer for a company that produces a web content management system, and you get email from an open source project leader who claims that your closed source project probably contains some of their open sou...

Cover image
Title page
Table of Contents
Copyright
Contributors
Acknowledgments
Introduction
Success Stories/Applications
Techniques
Wisdom