22ndDynamic Econometrics Conference - 16–17 September 2020, Online
Customized Markdown and .docx tables using listtab and docxtab
Abstract: Statisticians make their living producing tables (and plots).
I present an update of a general family of methods for making customized tables called the DCRIL path (decode, characterize, reshape, insert, list), with customized table cells (using the sdecode package), customized column attributes (using the chardef package), customized column labels (using the xrewide package), and/or customized inserted gap-row labels (using the insingap package), and listing these tables to automatically generated documents. This demonstration uses the listtab package to list Markdown tables for browser-ready HTML documents, which Stata users like to generate, and the docxtab package to list .docx tables for printer-ready .docx documents, which our superiors like us to generate
Roger B. Newson
King's College London
Multiply imputing informatively censored time-to-event data
Time-to-event data, such as overall survival in a cancer clinical trial, are commonly right-censored, and this censoring is commonly assumed to be noninformative. While noninformative censoring is plausible when censoring is due to end of study, it is less plausible when censoring is due to loss to follow-up. Sensitivity analyses for departures from the noninformative censoring assumption can be performed using multiple imputation under the Cox model. These have been implemented in R but are not commonly used. We propose a new implementation in Stata. Our existing stsurvimpute command (on SSC) imputes right-censored data under noninformative censoring, using a flexible parametric survival model fit by stpm2. We extend this to allow a sensitivity parameter gamma, representing the log of the hazard ratio in censored individuals versus comparable uncensored individuals (the informative censoring hazard ratio, ICHR). The sensitivity parameter can vary between individuals, and imputed data can be recensored at the end-of-study time. Because the mi suite does not allow imputed variables to be stset, we create an imputed data set in ice format and analyze it using mim. In practice, sensitivity analysis computes the treatment effect for a range of scientifically plausible values of gamma. We illustrate the approach using a cancer clinical trial. References: Jackson D., I. R. White, S. Seaman, H. Evans, K. Baisley, J. Carpenter. 2014. Relaxing the independent censoring assumption in the Cox proportional hazards model using multiple imputation. Statistics in medicine. 33: 4681–4694. https://CRAN.R-project.org/package=InformativeCensoring Contributor: Patrick Royston MRC Clinical Trials Unit at UCL
Ian R. White
MRC Clinical Trials Unit at UCL
Influence analysis with panel data using Stata
The presence of units that possess extreme values in the dependent variable and independent variables (for example, vertical outliers, good and bad leverage points) has the potential to severely bias least-squares (LS) estimates—for example, regression coefficients and standard errors. Diagnostic plots (such as leverage-versus-squared residual plots) and measures of overall influence (for example, Cook's  distance) are usually used to detect such anomalies, but there are two different problems arising from their use. First, available commands for diagnostic plots are built for cross-sectional data, and some data manipulation is necessary for panel data. Second, Cook-like distances may fail to flag multiple anomalous cases in the data because they do not account for pairwise influence of observations (Atkinson 1993; Chatterjee and Hadi 1988, Rousseeuw 1991; Rousseeuw and Van Zomeren 1990, Lawrance 1995). I overcome these limits as follows. First, I formalize statistical measures to quantify the degree of leverage and outlyingness of units in a panel-data framework to produce diagnostic plots suitable for panel data. Second, I build on Lawrance's  pairwise approach by proposing measures for joint and conditional influence suitable for panel-data models with fixed effects. I develop a method to visually detect anomalous units in a panel dataset and identify their types; investigate the effect of these units on LS estimates, and on other units’ influence. I propose two community-contributed commands in Stata to implement this method. xtlvr2plot produces a leverage-versus-residual plot suitable for panel data, and a summary table with the list of detected anomalous units and their type. xtinfluence calculates the joint and conditional influence and effects of pairs of units, and generates network-style plots (an option between scatterplot or heat plot is allowed by the command). JEL codes: C13, C15, C23.
Institute for Analytics and Data Science and University of Essex
A suite of programs for the design, development, and validation of clinical prediction models
An ever-increasing number of research questions focuses on the development and validation of clinical prediction models to inform individual diagnosis and prognosis in healthcare. These models predict outcome values (for example, pain intensity) or outcome risks (for example, five-year mortality risk) in individuals from a target population (for example, pregnant women; cancer patients). Development and validation of such models is a complex process, with a myriad of statistical methods, validation measures, and reporting options. It is therefore not surprising that there is considerable evidence of poor methodology in such studies. In this presentation, I will introduce a suite of ancillary software packages with the prefix “pm”. The pm-suite of packages aims to facilitate the implementation of methodology for building new models, validating existing models and transparent reporting. All packages are in line with the recommendations of the TRIPOD guidelines, which provide a benchmark for the reporting of prediction models. I will showcase a selection of packages to aid in each stage of the life cycle of a prediction model, from the initial design (for example, sample-size calculation using pmsampsize and pmvalsampsize), to development and internal validation (for example, calculating model performance using pmstats), external validation (for example, flexible calibration plots of performance in new patients using pmcalplot), and model updating (for example, comparing updating methods using pmupdate). Through an illustrative example, I will demonstrate how these packages allow researchers to perform common prediction modeling tasks quickly and easily while standardizing methodology.
Dr. Joie Ensor
University of Birmingham
Bayesian model averaging
Abstract: Model uncertainty accompanies many data analyses.
Stata's new bma suite, which performs Bayesian model averaging (BMA), helps address this uncertainty in the context of linear regression. Which predictors are important given the observed data? Which models are more plausible? How do predictors relate to each other across different models? BMA can answer these and more questions. BMA uses the Bayes theorem to aggregate the results across multiple candidate models to account for model uncertainty during inference and prediction in a principled and universal way. In my presentation, I will describe the basics of BMA and demonstrate it with the bma suite. I will also show how BMA can become a useful tool for your regression analysis, Bayesian or not!
Prioritizing clinically important outcomes using the win ratio
Abstract:The win ratio is a statistical method used for analyzing composite outcomes in clinical trials.
Composite outcomes are composed of two or more distinct “component” events (for example, heart attacks, death) and are often analyzed using time-to-first event methods ignoring the relative importance of the component events. When using the win ratio, component events are instead placed into a hierarchy from most to least important; more important components can then be prioritized over less important outcomes (for example, death, followed by myocardial infarction). The method works by first placing patients into pairs. Within each pair, one evaluates the components in order of priority starting with the most important until one of the pair is determined to have a better outcome than the other. A major advantage of the approach is its flexibility: one can include in the hierarchy outcomes of different types (for example, time-to-event, continuous, binary, ordinal, and repeat events). This can have major benefits, for example by allowing assessment of quality of life or symptom scores to be included as part of the outcome. This is particularly helpful in disease areas where recruiting enough patients for a conventional outcomes trial is unfeasible. The win-ratio approach is increasingly popular, but a barrier to more widespread adoption is a lack of good statistical software. The calculation of sample sizes is also complex and usually requires simulation. We present winratiotest, the first package to implement win-ratio analyses in Stata. The command is flexible and user-friendly. Included in the package is the first software (we know of) that can calculate the sample size for win-ratio-based trials without requiring simulation. Contributors: Tim Collier Joan Pedro Ferreira London School of Hygiene and Tropical Medicine
London School of Hygiene and Tropical Medicine
Object-oriented programming in Mata
Abstract: Object-oriented programming (OOP) is a programming paradigm that is ubiquitous in today's landscape of programming languages.
OOP code proceeds by first defining separate entities—classes—and their relationships, and then lets them communicate with each another. Mata, Stata's matrix language, does have such OOP capabilities. Comparison with some other programming languages that are object-oriented, like Java or C++, Mata offers a lighter implementation, but does so by striking a nice balance between feature availability and language complexity. This presentation explores OOP features in Mata by describing the code behind dtms, a community-contributed package for discrete-time multistate model estimation. Estimation in dtms proceeds in several steps, where each step can nest multiple results of the next level, thus building up a treelike structure of results. The presentation explains how this treelike structure is implemented in Mata using OOP, and what the benefits of using OOP for this task are. These include easier code maintenance via a more transparent code structure, shorter coding time, and an easier implementation of efficient calculations. The presentation will at first provide simple examples of useful classes; for example, a class that represents a Stata matrix in Mata, or a class that can grab, hold, and restore Stata e()-results. More complex relationships among classes will then be explored in the context of the treelike results structure of dtms. While topics covered will include such technically sounding concepts as class composition, self-threading code, inheritance, and polymorphism, an effort will be made to link these concepts to tasks that are relevant to Stata users that have already gained or are interested in gaining an initial proficiency of Mata.
Yulia Daniel C. Schneider
Max Planck Institute for Demographic Research
A review of machine learning commands in Stata: Performance and usability evaluation
Abstract:This presentation provides a comprehensive survey reviewing machine learning (ML) commands in Stata.
I systematically categorize and summarize the available ML commands in Stata and evaluate their performance and usability for different tasks such as classification, regression, clustering, and dimension reduction. I also provide examples of how to use these commands with real-world datasets and compare their performance. This review aims to help researchers and practitioners choose appropriate ML methods and related Stata tools for their specific research questions and datasets, and to improve the efficiency and reproducibility of ML analyses using Stata. I conclude by discussing some limitations and future directions for ML research in Stata.
On the shoulders of giants: Writing wrapper commands in Stata
Abstract: For repeated tasks, it is convenient to use commands with simple syntax that carry out more complicated tasks under the hood.
These can be data management and visualization tasks or statistical analyses. Many of these tasks are variations or special cases of more versatile approaches. Instead of reinventing the wheel, wrapper commands build on the existing capabilities by “wrapping” around other commands. For example, certain types of graphs might require substantial effort when building them from scratch using Stata's graph twoway commands, but this process can be automated with a dedicated command. Similarly, many estimators for specific models are special cases of more general estimation techniques, such as maximum likelihood or generalized method of moments estimators. A wrapper command can be used to translate relatively simple syntax into the more complex syntax of Stata's ml or gmm commands, or even directly into the underlying optimize() or moptimize() Mata functions. Many official Stata commands can be regarded as wrapper commands, and often there is a hierarchical wrapper structure with multiple layers. For example, most commands for mixed-effects estimation of particular models are wrappers for the general meglm command, which itself just wraps around the undocumented _me_estimate command, which then calls gsem, which in turn initiates the estimation with the ml package. The main purpose of the higher-layer wrappers is typically syntax parsing. With every layer the initially simple syntax is translated into the more general syntax of the lower-layer command, but the user only needs to be concerned with the basic syntax of the lop-layer command. Similarly, community-contributed commands often wrap around official or other community-contributed commands. They may even wrap around packages written for other programming environments, such as Python. In this presentation, I discuss different types of wrapper commands and focus on practical aspects of their implementation. I illustrate these ideas with two of my own commands. The new spxtivdfreg wrapper adds a spatial dimension to the xtivdfreg command (Kripfganz and Sarafidis 2021) for defactored instrumental-variables estimation of large panel-data models with common factors. The xtdpdgmmfe wrapper provides a simplified syntax for the GMM estimation of linear dynamic fixed-effects panel-data models with the xtdpdgmm command.
Univeristy of Exeter
Gigs package -new egen extensions for international newborn and child growth standards
Abstract:Children’s growth status is an important measure commonly used as a proxy indicator of advancements in a country’s health, human capital, and economic development.
Understanding how and why child growth patterns have changed is necessary for characterizing global health inequalities. Sustainable development goal 3.2 aims to reduce preventable newborn deaths by at least 12 deaths per 1,000 live births and child deaths to 25 per 1,000 live births (WHO/UNICEF, 2019). However, large gaps remain in achieving these goals: currently 54 and 64 (of 194) countries will miss the targets for child (<5 years) and neonatal (
London School of Hygiene and Tropical Medicine
Plot suite: Fast graphing commands for very large datasets
Abstract: This presentation showcases the functionality of the new “plot suite” of graphing commands.
The suite excels in visualizing very large datasets, enabling users to produce a variety of highly-customizable plots in a fraction of time required by Stata's native graphing commands.
Melbourne Institute of Applied Economic and Social Research
pystacked and ddml: Machine learning for prediction and causal inference in Stata
Abstract:pystacked implements stacked generalization (Wolpert 1992) for regression and binary classification via Python’s scikit-learn.
Stacking is an ensemble method that combines multiple supervised machine learners—the “base” or “level-0” learners—into a single learner. The currently-supported base learners include regularized regression (lasso, ridge, elastic net), random forest, gradient boosted trees, support vector machines, and feed-forward neural nets (multilayer perceptron). pystacked can also be used to fit a single base learner and thus provides an easy-to-use API for scikit-learn’s machine learning algorithms. ddml implements algorithms for causal inference aided by supervised machine learning as proposed in “Double/debiased machine learning for treatment and structural parameters” (Econometrics Journal 2018). Five different models are supported, allowing for allowing for binary or continuous treatment variables and endogeneity in the presence of high-dimensional controls and/or instrumental variables. ddml is compatible with many existing supervised machine learning programs in Stata, and in particular has integrated support for pystacked, making it straightforward to use machine learner ensemble methods in causal inference applications. Contributors: Achim Ahrens ETH Zürich Christian B. Hansen Thomas Wiemann University of Chicago
Mark E. Schaffer
Fitting the Skellam distribution in Stata
Abstract: The Skellam distribution is a discrete probability distribution related to the difference between two independent Poisson-distributed random variables.
It has been used in a variety of contexts, including sports or supply and demand imbalances in shared transportation. To the best of our knowledge, Stata does not support the Skellam distribution or the Skellam regression. In this presentation, I plan to show how to fit the parameters of a Skellam distribution and Skellam regression using Mata’s optimize function. The optimization problem is then packaged into a basic Stata command that I plan to describe.
Université libre de Bruxelles
A short report on making Stata secure and adding metadata in a new data platform
Abstract:The presentation has two parts. A version of the first part was presented at the 2022 Northern European Stata Conference.
Part 1. Securing Stata in a secure environment. Data access and logging. At CRN, we develop a secure environment for using Stata. A short description of this work is given describing the data access and logging of data extraction (JDBC + Java plugins) and Stata commands. Part 2. Metadata using characteristics. In the new solution, metadata is automatically attached to Stata .dta characteristics when users fetch data from the data warehouse. The implementation is described, along with some small utility programs to use metadata, and examples of use are presented.
Cancer Registry of Norway
Facilities for optimizing and designing multiarm multistage (MAMS) randomized controlled trials with binary outcomes
Abstract: In this presentation, we introduce two Stata commands, nstagebin and nstagebinopt, which can be used to facilitate the design of multiarm multistage (MAMS) trials with binary outcomes.
MAMS designs are a class of efficient and adaptive randomized clinical trials that have successfully been used in many disease areas, including cancer, TB, maternal health, COVID-19, and surgery. The nstagebinopt command finds a class of efficient “admissible” designs based on an optimality criterion using a systematic search procedure. The nstagebin command calculates the stagewise sample sizes, trial timelines, and the overall operating characteristics of MAMS design with binary outcomes. Both programs allow the use of Dunnett's correction to account for multiple testing. We also use the ROSSINI 2 MAMS design, an ongoing MAMS trial in surgical wound infection, to illustrate the capabilities of both programs. The new Stata commands facilitate the design of MAMS trials with binary outcomes where more than one research question can be addressed under one protocol. Reference: Choodari-Oskooei B., D. J. Bratton, and M. Parmar. 2023. Facilities for optimizing and designing multiarm multistage (MAMS) randomised controlled trials with binary outcomes. Stata Journal. Under review. Contributors: Daniel J. Bratton GlaxoSmithKline Mahesh KB Parmar University College London
University College London
The logistics organizer for the 2023 UK Stata Conference is Timberlake Consultants, the Stata distributor to the United Kingdom and Ireland, France, Spain, Portugal, the Middle East and North Africa, Brazil, and Poland.
View the proceedings of previous Stata Conferences and Users Group meetings.