Skip to main content
ilovecalcs logoilovecalcs.

Math · Statistics · Live

P-value calculator — Z, t, Chi-Square & F tests.

Calculate the exact p-value from any Z, t, chi-square, or F test statistic. Choose one-tailed or two-tailed for Z and t tests, see significance at every standard α level, and compare against critical values, all in real time.

How it worksReal-time

Inputs

Hypothesis test

Can be negative

p-value

Z-test

p = 0.0198

Significant (p < 0.05)

α = 0.001

Not sig.

α = 0.01

Not sig.

α = 0.05

Significant

α = 0.1

Significant

p = 0.0198. For a two-tailed Z-test , the result is statistically significant at α = 0.05. Reject the null hypothesis.

Critical values

Your statistic vs standard thresholds

Significance levelCritical valueYour statReject H₀?
α = 0.1±1.64492.33Yes ✓
α = 0.05±1.96002.33Yes ✓
α = 0.01±2.57582.33No
α = 0.001±3.29052.33No
Critical values for two-tailed test. Z uses standard normal quantiles.

Interpretation guide

p = 0.0198 — what does that mean?

p = 0.0198 means: if the null hypothesis (H₀) were true, the probability of observing a test statistic as extreme as |2.33| or higher by chance alone is 1.98%.

Common misconception: The p-value is NOT the probability that H₀ is true. It is the probability of the data (or more extreme data) given H₀ is true. A low p-value means the data are unlikely under H₀ — not that H₀ is false.

Field guide

What the p-value is and what people get wrong about it.

What is a p-value?

A p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one actually observed, assuming the null hypothesis (H₀) is true. It measures how surprising the data are under the null hypothesis — not how likely the null hypothesis is.

Formally: p = P(observing data this extreme | H₀ is true). A small p-value (conventionally p < 0.05) indicates that the observed data would be unlikely if H₀ were true, so we reject H₀. A large p-value means the data are not particularly surprising under H₀, so we fail to reject it.

The most common misconceptions

The p-value is one of the most frequently misinterpreted statistics in all of science. Common wrong interpretations include:

  • "p = 0.03 means there is a 3% probability that H₀ is true."Wrong. The p-value says nothing about the probability of H₀. H₀ is either true or false; it doesn't have a probability. The p-value is a conditional probability: P(data | H₀), not P(H₀ | data).
  • "p = 0.03 means there is a 97% probability that H₁ is true."Also wrong, for the same reason. Determining P(H₁ | data) requires prior probabilities and uses Bayesian inference, which is different.
  • "p > 0.05 means H₀ is true."Failing to reject H₀ is not the same as accepting it. The data may simply be insufficient to detect a real effect.

One-tailed vs two-tailed tests

The choice of tail affects the p-value and should be made beforelooking at the data, based on the research question:

  • Two-tailed: H₁ is that the parameter differs from H₀ in either direction. p = P(|T| ≥ |tobs|). Use this when you have no directional hypothesis. It is the more conservative and more common default.
  • Right-tailed: H₁ is that the parameter is greater than H₀. p = P(T ≥ tobs).
  • Left-tailed: H₁ is that the parameter is less than H₀. p = P(T ≤ tobs).

For a given test statistic, the one-tailed p-value is exactly half the two-tailed p-value (when the statistic is in the hypothesised direction). Switching from two-tailed to one-tailed to cross the 0.05 threshold after seeing the data is a form of p-hacking.

The four test families

Z-test

Used when the population standard deviation σ is known, or when the sample is large enough that the sample standard deviation is a reliable estimate (typically n ≥ 30). The test statistic follows a standard normal distribution N(0, 1) under H₀. Common applications: one-sample and two-sample proportion tests, large-sample means.

t-test

Used when σ is unknown and estimated from the sample, especially for small samples (n < 30). The test statistic follows a Student's t-distribution with (n − 1) degrees of freedom for a one-sample test, (n₁ + n₂ − 2) for an independent two-sample test, or (n − 1) for a paired t-test. The t-distribution has heavier tails than the normal, producing higher p-values for the same test statistic — a conservative correction for the extra uncertainty.

Chi-Square (χ²) test

Used for categorical data. Common applications include goodness-of-fit tests (does observed frequency match a theoretical distribution?), tests of independence (are two categorical variables independent?), and tests of homogeneity. The χ² statistic is always non-negative; the p-value is always the upper-tail probability. Degrees of freedom depend on the specific test: for a goodness-of-fit test with k categories, df = k − 1; for a contingency table with r rows and c columns, df = (r−1)(c−1).

F-test

Used to compare variances or model fit. ANOVA uses the F-test to compare group means by taking the ratio of between-group variance to within-group variance. Regression uses it to test overall model significance. The F statistic is always non-negative; the p-value is the upper-tail probability. Two degrees-of-freedom parameters are required: df₁ (numerator) and df₂ (denominator), which in ANOVA correspond to (k−1) and (N−k) respectively.

The 0.05 threshold — why it is and isn't magic

The α = 0.05 threshold was popularised by R.A. Fisher in the 1920s as a convenient rule of thumb, not a fundamental truth. Over time it has become a de facto publishing gate in many fields, causing significant problems:

  • Studies with p = 0.049 and p = 0.051 are treated radically differently despite being statistically indistinguishable.
  • Publication bias toward p < 0.05 results inflates the literature with false positives.
  • Effect size and confidence intervals are often more informative than the binary significant/not-significant classification.

Modern statistical practice increasingly recommends reporting exact p-values, effect sizes, and confidence intervals rather than binary significance labels.