AKA: OnYourLaptop - Part 3/3 - Web Scraping - gathering data from websites, HTML & JSON Parsing, APIs and gathering Twitter streams
Preexisting clean data sets such as the General Social Survey (GSS) or Census data, for example, are readily available, cover long periods of time, and have well documented codebooks. However, some people want to gather their own data. Recent tools and techniques for finding and compiling data from webpages, whole websites or social media sources have become more accessible. But these techniques provide a different layer of complexity.
Data Cleaning and Analysis
Advanced Excel for Data Projects
Spreadsheets are a standard tool for many data projects, whether because of the ability to easily edit data, the ubiquity of spreadsheet programs, or the added features like charts and filters. This workshop extends the introduction from our Basic Data Cleaning and Analysis for Data Tables workshop by focusing on more advanced features in Excel. Examples include filtering, pivot tables, and data visualizations.
Analysis with R
Explore the basics of the R programming language for statistics and graphing in this introductory workshop. This hands on workshop covers the basics of getting help, loading, managing, graphing, and analyzing data in R. No previous experience with R is required. Course materials will be available before the class for workshop participants.
Basic Data Cleaning and Analysis for Data Tables
Tables of data, like those you see in spreadsheets or relational databases, are the foundation of most data-driven research today. There are many pitfalls of working with these tables, though, that most people end up having to learn the hard way. In this workshop, we'll take a dataset that has a variety of different properties and learn to work through many common steps of data-driven research to clean and begin analyzing the data. We'll be using Excel to make sure the methods we suggest can be reproduced easily "at home," but many of these techniques are important for other data analysis tools as well. No data experience necessary.
Introduction to Stata
Stata for Research focuses on the core concepts of using stata. This workshop provides a hands on overview of how to load, manage, and analyze data using stata. The workshop will also include a brief introduction to stata graphics as well. No previous experience with stata is required.
OpenRefine (Previously Google Refine)
AKA: OnYourLaptop - Part 1/3 - OpenRefine - Data Cleaning, Mining, Transformations, and Text Normalization
Open Refine (formerly Google Refine) is a tool for working with semi-structured datasets. It allows you to explore data, easily find facet patterns within data, enables simple detection of data inconsistencies, and offers quick clean-up and transformation options. Open Refine is an often intuitive but powerful tool for normalizing data before importing the dataset into a presentation application (e.g. mapping, charting, or analyzing.) In this hands-on class, we'll explore how Refine can help with common data cleaning challenges.
Regular Expressions (RegEx):
AKA: OnYourLaptop - Part 2/3 - Regular Expressions (RegEx)
Regular Expressions are a powerful method of finding patterns in text. For example: find all words ending in "ing"; all words which begin with a capital letter; all telephone area codes that begin with either the numbers 7 or 8; all email addresses which contain "duke.edu". Many programming languages use regular expressions as a means to support pattern matching.
Mapping and GIS
Introduction to ArcGIS
Do you want to find out how geographic information (GIS) software can aid your research? This class will provide an overview of how ArcGIS software can help you analyze or visualize digital data that has a locational component, as well as discuss starting points for obtaining data. Examples will focus on social science data, but attendees are encouraged to ask questions regarding their own needs and will be welcome to make one-on-one appointments later for more focused instruction.
This class will show some ways that ArcGIS can be used for the analysis and visualization of historical spatial data. Topics discussed will be: sources for GIS layers reflecting the past, georeferencing a scanned historic map, creating new layers from scratch based on known locations of features, editing existing GIS layers to reflect former features and vectorizing a scanned map to create editable features.
Introduction to QGIS
Looking for an open source option for GIS? QGIS is free and it is one alternative to using ArcGIS. In this workshop we will demonstrate how to import and analyze data in QGIS and discuss the benefits of using QGIS over other GIS software.
Adobe Illustrator for Diagrams and Visualizations
Do you ever try to draw simple diagrams for proposals, reports or publications? Have you ever struggled to get a graphing program to make your plots look just right? Adobe Illustrator is a vector graphics editing program which can be very useful for faculty, students and staff in these types of situations, but many people avoid it because of the seemingly steep learning curve. In this workshop I will present a few basic principles of good graphic design, and then run through some simple examples of Illustrator's capabilities, showing you how to start using it to modify your graphs and create diagrams to explain your ideas.
Advanced Tableau (Data Structures)
This workshop will focus on the challenges of using different types and structures of data in Tableau. We will learn how to clean and organize various data sources for Tableau, how to join and blend data to combine datasets, and how to design visualizations when datasets have been joined or blended.
Designing Academic Figures and Posters
Figures and other forms of visual representation can have a huge impact on the communication of research to a broader audience. A well designed figure can summarize research, captivate audience interest, and/or explain complicated phenomena and processes. Likewise, becoming familiar with good strategies for poster design allows researchers to take full advantage of the opportunity to network with colleagues and promote their own research. This workshop will cover basic considerations for designing effective academic figures and posters, including use of color, layout, fonts/typography, and software choices.
Easy Interactive Charts and Maps with Tableau
Tableau Public (available for both Windows and Mac) is free software that allows individuals to quickly create interactive visualizations of their research and business analytics data. This workshop will focus on using Tableau Public to create data visualizations, starting with an overview of the structure of the program and the terminology used. The workshop will include a sample data visualization and mapping project, focusing especially on some of the new features in Tableau Public 9. We will also discuss publishing to the Tableau Public web server and related services and tools, like the full Tableau Desktop application (free for full-time students).
Making Data Visual
The process of making data visual can be nuanced and iterative. Sometimes we start a project with a very specific idea of the kind of visualization we want, but other times we may not be sure what will work best. This workshop will address three important aspects of making data visual: identifying the goal of your visualization, identifying the audience of your visualization, and understanding the pros and cons of different types of visualizations. This workshop will focus not on any specific software application, but instead will focus on helping attendees develop instincts for what kinds of visualizations match well with particular datasets, goals, and audiences. While some mention will be made of non-traditional visualizations, like custom diagrams, the emphasis will be on standard visualization types.
Structuring Humanities Data
Have you ever wondered what medieval scribes, ancient artifacts, historical paintings and Victorian fiction have to do with data? Have you ever thought about how social media data can be used to document and analyze groups, events and moments in history? Digital tools can open up new and exciting possibilities for Humanistic inquiry, as long as you see people, places, dates and relationships as data and know how to "speak" in the way a computers understand. Through a series of case studies, the Structuring Humanities Data workshop will help Humanists see the data in their subjects and provide guidelines for how to structure and gather data in simple spreadsheets, including ways to deal with tricky but common situations like uncertainty in dates. The workshop will also show examples where computers were used to help gather data automatically, and look under the hood at some data driving visualizations on the web.
Data Storage and Management
Data Management Plans: Grants, Strategies and Considerations
Fall 2012 - Spring 2014
In the last few years granting agencies across the disciplines have increasingly required data management plans as part of a grant proposal that detail strategies to manage, share and preserve research data as part of a funded grant project. NSF, the NIH, the National Endowment for the Humanities and other organizations have similar requirements, and Duke policy requires that research records (including digital data) be kept for at least five years. How should researchers respond? In this presentation, we’ll give an overview of research data management challenges and opportunities and describe some approaches for meeting them. We’ll ask the audience to share how they do data management now, and we’ll talk about planning underway for new services to help with data management at Duke.
Data Cleaning and Analysis
Useful R Packages: Extensions for Data Analysis, Management, and Visualization
The basic version of the R programming language provides a powerful tool for data analysis, but much of the value in R lies in the wide range of libraries that extend its basic functionality. This workshop shares a number of popular extensions to R that enable rich graphics (ggplot, google graphics), file conversions, and additional statistical tests. A basic familiarity to R would be useful for this workshop.
Introduction to Text Analysis
Fall 2012 - Spring 2014
Many research projects involve textual data, and computational advances now provide the means to engage in various types of automated text analysis that can enhance these projects. Understanding what analysis techniques are available and where they can appropriately be applied is an important first step to beginning a text analysis project.
This hands-on approach to text analysis will give a quick overview of small- and large-scale text-based projects before addressing strategies for organizing and conducting text analysis projects. Tools for data collection, parsing and eventual analysis will be introduced and demonstrated. The workshop will focus on acquiring and preparing text sources for small-scale projects and text-based visualizations, but many of the techniques will be useful for larger projects as well. For this introduction, the focus will primarily be on using Graphical User Interface (GUI) tools like Microsoft Excel and Google Refine, instead of programming languages and command line approaches.
Mapping and GIS
ArcGIS Online (AGOL) is a companion to the ArcGIS client that allows members of a group to store and share spatial data online and that can be used independently or in conjunction with the client. We'll discuss aspects of the AGOL organizational account, adding and accessing content, creating map and feature services, creating and sharing web maps and presentations, publishing web applications, and using analysis tools.
Web GIS Applications
Compare and contrast several products intended for geospatial visualization (e.g., a map to embed in a blog or PowerPoint, or for a poster session) and in some cases for GIS data analysis. (1) ArcGIS Online: Companion to the ArcGIS client that allows members of a group to store and share spatial data online, and that can be used independently or in conjunction with the client; (2) GeoCommons: both a repository for spatial data as well as an analysis and visualization tool; (3) Google Earth: emphasis on its features that are most applicable in an academic setting. See our schedule for another session on Google Fusion Tables.
Google Fusion Tables
Fall 2011 - Spring 2014
Introduction to the features of Google Fusion Tables, which include merging datasets, filtering and aggregating data, and visualizing data by creating online maps and graphs. For certain tasks, it can serve as an alternative to using statistical software such as Stata or GIS software such as ArcGIS.
Top 10 Dos and Don'ts for Charts and Graphs
Spring 2013 - Fall 2013
Simple charts and graphs can be incredibly effective at summarizing data. They are common and thus easier for a wide audience to understand. They are also easy to produce in the tools many people regularly use for other data analysis or project management work. With a few simple tips and tricks, you can avoid common missteps and make sure your charts are clear and easy to understand.
Data Visualization on the Web
Until recently, most data visualizations were created by installing statistics or visualization software onto our computers. In recent years, however, a number of web-based data visualization tools have been developed. These tools offer many advantages over downloaded software applications - visualizations can be created on PCs or Macs, groups can often collaborate in their creation, and the results can often be shared more easily. This workshop will give a quick overview of several web-based visualizations tools, including Google Spreadsheets and Raw. Participants are encouraged to bring laptops to follow along with the demonstrations.
Data Visualization on the Web (Advanced)
Using Gephi for Network Analysis and Visualization
Networks (or graphs) are a compelling way of studying relationships between people, places, object, ideas, etc. Generating network data and visualizations, however, can be an involved process requiring specialized tools. This workshop will explore some of the easier ways to produce, load, and visualize network data using Gephi, an open source, multi-platform network analysis and visualization application. Time will be available at the end of the workshop to discuss specific projects and test out different techniques with Gephi.